All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: avoid IO issues unaligned memory allocation
@ 2019-08-21  8:38 Dave Chinner
  2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-21  8:38 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

These patches fix the issue reported by Vishal on 5.3-rc1 with pmem,
reproduced on brd, and previously seen with xenblk.

The core issue is that when memory allocation debugging is turned
on, heap memory is no longer naturally aligned and that means
we attach memory with large memory regions with random alignment to
bios. Some block drivers only support 512 byte aligned memory
buffers, and these then silently break when fed an unaligned
buffers.

This happens silently because nothing in the block or driver layers
actually validates that memory buffers containing kernel memory
are correctly aligned. Buffers may be bounced in the drivers if
they are unaligned, but not all driver support this, hence the
breakage.

This patchset added memory allocation tracing, 512 byte aligned
allocation support via kmem_alloc_io(), and a re-implementation of
bio_add_page() to add memory buffer alignment verification. The last
patch causes unaligned allocations to fail fast, noisily and safely,
such that any developer running KASAN will immediately notice if
XFS attaches an unaligned mmemory buffer to a bio....

Cheers,

Dave.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/3] xfs: add kmem allocation trace points
  2019-08-21  8:38 [PATCH 0/3] xfs: avoid IO issues unaligned memory allocation Dave Chinner
@ 2019-08-21  8:38 ` Dave Chinner
  2019-08-21 13:34   ` Brian Foster
  2019-08-21 23:20   ` Christoph Hellwig
  2019-08-21  8:38 ` [PATCH 2/3] xfs: add kmem_alloc_io() Dave Chinner
  2019-08-21  8:38 ` [PATCH 3/3] xfs: alignment check bio buffers Dave Chinner
  2 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-21  8:38 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When trying to correlate XFS kernel allocations to memory reclaim
behaviour, it is useful to know what allocations XFS is actually
attempting. This information is not directly available from
tracepoints in the generic memory allocation and reclaim
tracepoints, so these new trace points provide a high level
indication of what the XFS memory demand actually is.

There is no per-filesystem context in this code, so we just trace
the type of allocation, the size and the allocation constraints.
The kmem code also doesn't include much of the common XFS headers,
so there are a few definitions that need to be added to the trace
headers and a couple of types that need to be made common to avoid
needing to include the whole world in the kmem code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/kmem.c             | 11 +++++++++--
 fs/xfs/libxfs/xfs_types.h |  8 ++++++++
 fs/xfs/xfs_mount.h        |  7 -------
 fs/xfs/xfs_trace.h        | 33 +++++++++++++++++++++++++++++++++
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 16bb9a328678..edcf393c8fd9 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -3,10 +3,10 @@
  * Copyright (c) 2000-2005 Silicon Graphics, Inc.
  * All Rights Reserved.
  */
-#include <linux/sched/mm.h>
+#include "xfs.h"
 #include <linux/backing-dev.h>
-#include "kmem.h"
 #include "xfs_message.h"
+#include "xfs_trace.h"
 
 void *
 kmem_alloc(size_t size, xfs_km_flags_t flags)
@@ -15,6 +15,8 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 	gfp_t	lflags = kmem_flags_convert(flags);
 	void	*ptr;
 
+	trace_kmem_alloc(size, flags, _RET_IP_);
+
 	do {
 		ptr = kmalloc(size, lflags);
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
@@ -35,6 +37,8 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
 	void	*ptr;
 	gfp_t	lflags;
 
+	trace_kmem_alloc_large(size, flags, _RET_IP_);
+
 	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
 	if (ptr)
 		return ptr;
@@ -65,6 +69,8 @@ kmem_realloc(const void *old, size_t newsize, xfs_km_flags_t flags)
 	gfp_t	lflags = kmem_flags_convert(flags);
 	void	*ptr;
 
+	trace_kmem_realloc(newsize, flags, _RET_IP_);
+
 	do {
 		ptr = krealloc(old, newsize, lflags);
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
@@ -85,6 +91,7 @@ kmem_zone_alloc(kmem_zone_t *zone, xfs_km_flags_t flags)
 	gfp_t	lflags = kmem_flags_convert(flags);
 	void	*ptr;
 
+	trace_kmem_zone_alloc(0, flags, _RET_IP_);
 	do {
 		ptr = kmem_cache_alloc(zone, lflags);
 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 802b34cd10fe..300b3e91ca3a 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -169,6 +169,14 @@ typedef struct xfs_bmbt_irec
 	xfs_exntst_t	br_state;	/* extent state */
 } xfs_bmbt_irec_t;
 
+/* per-AG block reservation types */
+enum xfs_ag_resv_type {
+	XFS_AG_RESV_NONE = 0,
+	XFS_AG_RESV_AGFL,
+	XFS_AG_RESV_METADATA,
+	XFS_AG_RESV_RMAPBT,
+};
+
 /*
  * Type verifier functions
  */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4adb6837439a..fdb60e09a9c5 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -327,13 +327,6 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
 }
 
 /* per-AG block reservation data structures*/
-enum xfs_ag_resv_type {
-	XFS_AG_RESV_NONE = 0,
-	XFS_AG_RESV_AGFL,
-	XFS_AG_RESV_METADATA,
-	XFS_AG_RESV_RMAPBT,
-};
-
 struct xfs_ag_resv {
 	/* number of blocks originally reserved here */
 	xfs_extlen_t			ar_orig_reserved;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8094b1920eef..8bb8b4704a00 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -23,6 +23,7 @@ struct xlog;
 struct xlog_ticket;
 struct xlog_recover;
 struct xlog_recover_item;
+struct xlog_rec_header;
 struct xfs_buf_log_format;
 struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
@@ -30,6 +31,10 @@ struct xfs_btree_cur;
 struct xfs_refcount_irec;
 struct xfs_fsmap;
 struct xfs_rmap_irec;
+struct xfs_icreate_log;
+struct xfs_owner_info;
+struct xfs_trans_res;
+struct xfs_inobt_rec_incore;
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
@@ -3575,6 +3580,34 @@ TRACE_EVENT(xfs_pwork_init,
 		  __entry->nr_threads, __entry->pid)
 )
 
+DECLARE_EVENT_CLASS(xfs_kmem_class,
+	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip),
+	TP_ARGS(size, flags, caller_ip),
+	TP_STRUCT__entry(
+		__field(ssize_t, size)
+		__field(int, flags)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->size = size;
+		__entry->flags = flags;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("size %zd flags 0x%x caller %pS",
+		  __entry->size,
+		  __entry->flags,
+		  (char *)__entry->caller_ip)
+)
+
+#define DEFINE_KMEM_EVENT(name) \
+DEFINE_EVENT(xfs_kmem_class, name, \
+	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip), \
+	TP_ARGS(size, flags, caller_ip))
+DEFINE_KMEM_EVENT(kmem_alloc);
+DEFINE_KMEM_EVENT(kmem_alloc_large);
+DEFINE_KMEM_EVENT(kmem_realloc);
+DEFINE_KMEM_EVENT(kmem_zone_alloc);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21  8:38 [PATCH 0/3] xfs: avoid IO issues unaligned memory allocation Dave Chinner
  2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
@ 2019-08-21  8:38 ` Dave Chinner
  2019-08-21 13:35   ` Brian Foster
  2019-08-21 23:24   ` Christoph Hellwig
  2019-08-21  8:38 ` [PATCH 3/3] xfs: alignment check bio buffers Dave Chinner
  2 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-21  8:38 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Memory we use to submit for IO needs strict alignment to the
underlying driver contraints. Worst case, this is 512 bytes. Given
that all allocations for IO are always a power of 2 multiple of 512
bytes, the kernel heap provides natural alignment for objects of
these sizes and that suffices.

Until, of course, memory debugging of some kind is turned on (e.g.
red zones, poisoning, KASAN) and then the alignment of the heap
objects is thrown out the window. Then we get weird IO errors and
data corruption problems because drivers don't validate alignment
and do the wrong thing when passed unaligned memory buffers in bios.

TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
512 byte alignment of buffers for IO, even if memory debugging
options are turned on. It is assumed that the minimum allocation
size will be 512 bytes, and that sizes will be power of 2 mulitples
of 512 bytes.

Use this everywhere we allocate buffers for IO.

This no longer fails with log recovery errors when KASAN is enabled
due to the brd driver not handling unaligned memory buffers:

# mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
 fs/xfs/kmem.h            |  1 +
 fs/xfs/xfs_buf.c         |  4 +--
 fs/xfs/xfs_log.c         |  2 +-
 fs/xfs/xfs_log_recover.c |  2 +-
 fs/xfs/xfs_trace.h       |  1 +
 6 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index edcf393c8fd9..ec693c0fdcff 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -30,30 +30,24 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 	} while (1);
 }
 
-void *
-kmem_alloc_large(size_t size, xfs_km_flags_t flags)
+
+/*
+ * __vmalloc() will allocate data pages and auxillary structures (e.g.
+ * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context here. Hence
+ * we need to tell memory reclaim that we are in such a context via
+ * PF_MEMALLOC_NOFS to prevent memory reclaim re-entering the filesystem here
+ * and potentially deadlocking.
+ */
+static void *
+__kmem_vmalloc(size_t size, xfs_km_flags_t flags)
 {
 	unsigned nofs_flag = 0;
 	void	*ptr;
-	gfp_t	lflags;
-
-	trace_kmem_alloc_large(size, flags, _RET_IP_);
-
-	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
-	if (ptr)
-		return ptr;
+	gfp_t	lflags = kmem_flags_convert(flags);
 
-	/*
-	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
-	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
-	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
-	 * the filesystem here and potentially deadlocking.
-	 */
 	if (flags & KM_NOFS)
 		nofs_flag = memalloc_nofs_save();
 
-	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags, PAGE_KERNEL);
 
 	if (flags & KM_NOFS)
@@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
 	return ptr;
 }
 
+/*
+ * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
+ * returned. vmalloc always returns an aligned region.
+ */
+void *
+kmem_alloc_io(size_t size, xfs_km_flags_t flags)
+{
+	void	*ptr;
+
+	trace_kmem_alloc_io(size, flags, _RET_IP_);
+
+	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
+	if (ptr) {
+		if (!((long)ptr & 511))
+			return ptr;
+		kfree(ptr);
+	}
+	return __kmem_vmalloc(size, flags);
+}
+
+void *
+kmem_alloc_large(size_t size, xfs_km_flags_t flags)
+{
+	void	*ptr;
+
+	trace_kmem_alloc_large(size, flags, _RET_IP_);
+
+	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
+	if (ptr)
+		return ptr;
+	return __kmem_vmalloc(size, flags);
+}
+
 void *
 kmem_realloc(const void *old, size_t newsize, xfs_km_flags_t flags)
 {
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 267655acd426..423a1fa0fcd6 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -59,6 +59,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 }
 
 extern void *kmem_alloc(size_t, xfs_km_flags_t);
+extern void *kmem_alloc_io(size_t, xfs_km_flags_t);
 extern void *kmem_alloc_large(size_t size, xfs_km_flags_t);
 extern void *kmem_realloc(const void *, size_t, xfs_km_flags_t);
 static inline void  kmem_free(const void *ptr)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index ca0849043f54..7bd1f31febfc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -353,7 +353,7 @@ xfs_buf_allocate_memory(
 	 */
 	size = BBTOB(bp->b_length);
 	if (size < PAGE_SIZE) {
-		bp->b_addr = kmem_alloc(size, KM_NOFS);
+		bp->b_addr = kmem_alloc_io(size, KM_NOFS);
 		if (!bp->b_addr) {
 			/* low memory - use alloc_page loop instead */
 			goto use_alloc_page;
@@ -368,7 +368,7 @@ xfs_buf_allocate_memory(
 		}
 		bp->b_offset = offset_in_page(bp->b_addr);
 		bp->b_pages = bp->b_page_array;
-		bp->b_pages[0] = virt_to_page(bp->b_addr);
+		bp->b_pages[0] = kmem_to_page(bp->b_addr);
 		bp->b_page_count = 1;
 		bp->b_flags |= _XBF_KMEM;
 		return 0;
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 7fc3c1ad36bc..1830d185d7fc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1415,7 +1415,7 @@ xlog_alloc_log(
 		iclog->ic_prev = prev_iclog;
 		prev_iclog = iclog;
 
-		iclog->ic_data = kmem_alloc_large(log->l_iclog_size,
+		iclog->ic_data = kmem_alloc_io(log->l_iclog_size,
 				KM_MAYFAIL);
 		if (!iclog->ic_data)
 			goto out_free_iclog;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 13d1d3e95b88..b4a6a008986b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -125,7 +125,7 @@ xlog_alloc_buffer(
 	if (nbblks > 1 && log->l_sectBBsize > 1)
 		nbblks += log->l_sectBBsize;
 	nbblks = round_up(nbblks, log->l_sectBBsize);
-	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
+	return kmem_alloc_io(BBTOB(nbblks), KM_MAYFAIL);
 }
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8bb8b4704a00..eaae275ed430 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3604,6 +3604,7 @@ DEFINE_EVENT(xfs_kmem_class, name, \
 	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip), \
 	TP_ARGS(size, flags, caller_ip))
 DEFINE_KMEM_EVENT(kmem_alloc);
+DEFINE_KMEM_EVENT(kmem_alloc_io);
 DEFINE_KMEM_EVENT(kmem_alloc_large);
 DEFINE_KMEM_EVENT(kmem_realloc);
 DEFINE_KMEM_EVENT(kmem_zone_alloc);
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21  8:38 [PATCH 0/3] xfs: avoid IO issues unaligned memory allocation Dave Chinner
  2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
  2019-08-21  8:38 ` [PATCH 2/3] xfs: add kmem_alloc_io() Dave Chinner
@ 2019-08-21  8:38 ` Dave Chinner
  2019-08-21 13:39   ` Brian Foster
  2019-08-21 23:29   ` Christoph Hellwig
  2 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-21  8:38 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Add memory buffer alignment validation checks to bios built in XFS
to catch bugs that will result in silent data corruption in block
drivers that cannot handle unaligned memory buffers but don't
validate the incoming buffer alignment is correct.

Known drivers with these issues are xenblk, brd and pmem.

Despite there being nothing XFS specific to xfs_bio_add_page(), this
function was created to do the required validation because the block
layer developers that keep telling us that is not possible to
validate buffer alignment in bio_add_page(), and even if it was
possible it would be too much overhead to do at runtime.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
 fs/xfs/xfs_buf.c    |  2 +-
 fs/xfs/xfs_linux.h  |  3 +++
 fs/xfs/xfs_log.c    |  6 +++++-
 4 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index e2148f2d5d6b..fbaea643c000 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
 	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
 }
 
+int
+xfs_bio_add_page(
+	struct bio	*bio,
+	struct page	*page,
+	unsigned int	len,
+	unsigned int	offset)
+{
+	struct request_queue	*q = bio->bi_disk->queue;
+	bool		same_page = false;
+
+	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
+		return -EIO;
+
+	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+		if (bio_full(bio, len))
+			return 0;
+		__bio_add_page(bio, page, len, offset);
+	}
+	return len;
+}
+
 int
 xfs_rw_bdev(
 	struct block_device	*bdev,
@@ -20,7 +41,7 @@ xfs_rw_bdev(
 {
 	unsigned int		is_vmalloc = is_vmalloc_addr(data);
 	unsigned int		left = count;
-	int			error;
+	int			error, ret = 0;
 	struct bio		*bio;
 
 	if (is_vmalloc && op == REQ_OP_WRITE)
@@ -36,9 +57,12 @@ xfs_rw_bdev(
 		unsigned int	off = offset_in_page(data);
 		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
 
-		while (bio_add_page(bio, page, len, off) != len) {
+		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
 			struct bio	*prev = bio;
 
+			if (ret < 0)
+				goto submit;
+
 			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
 			bio_copy_dev(bio, prev);
 			bio->bi_iter.bi_sector = bio_end_sector(prev);
@@ -48,14 +72,16 @@ xfs_rw_bdev(
 			submit_bio(prev);
 		}
 
+		ret = 0;
 		data += len;
 		left -= len;
 	} while (left > 0);
 
+submit:
 	error = submit_bio_wait(bio);
 	bio_put(bio);
 
 	if (is_vmalloc && op == REQ_OP_READ)
 		invalidate_kernel_vmap_range(data, count);
-	return error;
+	return ret ? ret : error;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7bd1f31febfc..a2d499baee9c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index ca15105681ca..e71c7bf3e714 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -145,6 +145,9 @@ static inline void delay(long ticks)
 	schedule_timeout_uninterruptible(ticks);
 }
 
+int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len,
+			unsigned int offset);
+
 /*
  * XFS wrapper structure for sysfs support. It depends on external data
  * structures and is embedded in various internal data structures to implement
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 1830d185d7fc..52f7d840d09e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1673,8 +1673,12 @@ xlog_map_iclog_data(
 		struct page	*page = kmem_to_page(data);
 		unsigned int	off = offset_in_page(data);
 		size_t		len = min_t(size_t, count, PAGE_SIZE - off);
+		int		ret;
 
-		WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len);
+		ret = xfs_bio_add_page(bio, page, len, off);
+		WARN_ON_ONCE(ret != len);
+		if (ret < 0)
+			break;
 
 		data += len;
 		count -= len;
-- 
2.23.0.rc1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] xfs: add kmem allocation trace points
  2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
@ 2019-08-21 13:34   ` Brian Foster
  2019-08-21 23:20   ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Brian Foster @ 2019-08-21 13:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 06:38:18PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When trying to correlate XFS kernel allocations to memory reclaim
> behaviour, it is useful to know what allocations XFS is actually
> attempting. This information is not directly available from
> tracepoints in the generic memory allocation and reclaim
> tracepoints, so these new trace points provide a high level
> indication of what the XFS memory demand actually is.
> 
> There is no per-filesystem context in this code, so we just trace
> the type of allocation, the size and the allocation constraints.
> The kmem code also doesn't include much of the common XFS headers,
> so there are a few definitions that need to be added to the trace
> headers and a couple of types that need to be made common to avoid
> needing to include the whole world in the kmem code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/kmem.c             | 11 +++++++++--
>  fs/xfs/libxfs/xfs_types.h |  8 ++++++++
>  fs/xfs/xfs_mount.h        |  7 -------
>  fs/xfs/xfs_trace.h        | 33 +++++++++++++++++++++++++++++++++
>  4 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 16bb9a328678..edcf393c8fd9 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
...
> @@ -85,6 +91,7 @@ kmem_zone_alloc(kmem_zone_t *zone, xfs_km_flags_t flags)
>  	gfp_t	lflags = kmem_flags_convert(flags);
>  	void	*ptr;
>  
> +	trace_kmem_zone_alloc(0, flags, _RET_IP_);

You can use kmem_cache_size() to determine object size here. With that
fixed:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  	do {
>  		ptr = kmem_cache_alloc(zone, lflags);
>  		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 802b34cd10fe..300b3e91ca3a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -169,6 +169,14 @@ typedef struct xfs_bmbt_irec
>  	xfs_exntst_t	br_state;	/* extent state */
>  } xfs_bmbt_irec_t;
>  
> +/* per-AG block reservation types */
> +enum xfs_ag_resv_type {
> +	XFS_AG_RESV_NONE = 0,
> +	XFS_AG_RESV_AGFL,
> +	XFS_AG_RESV_METADATA,
> +	XFS_AG_RESV_RMAPBT,
> +};
> +
>  /*
>   * Type verifier functions
>   */
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 4adb6837439a..fdb60e09a9c5 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -327,13 +327,6 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
>  }
>  
>  /* per-AG block reservation data structures*/
> -enum xfs_ag_resv_type {
> -	XFS_AG_RESV_NONE = 0,
> -	XFS_AG_RESV_AGFL,
> -	XFS_AG_RESV_METADATA,
> -	XFS_AG_RESV_RMAPBT,
> -};
> -
>  struct xfs_ag_resv {
>  	/* number of blocks originally reserved here */
>  	xfs_extlen_t			ar_orig_reserved;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 8094b1920eef..8bb8b4704a00 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -23,6 +23,7 @@ struct xlog;
>  struct xlog_ticket;
>  struct xlog_recover;
>  struct xlog_recover_item;
> +struct xlog_rec_header;
>  struct xfs_buf_log_format;
>  struct xfs_inode_log_format;
>  struct xfs_bmbt_irec;
> @@ -30,6 +31,10 @@ struct xfs_btree_cur;
>  struct xfs_refcount_irec;
>  struct xfs_fsmap;
>  struct xfs_rmap_irec;
> +struct xfs_icreate_log;
> +struct xfs_owner_info;
> +struct xfs_trans_res;
> +struct xfs_inobt_rec_incore;
>  
>  DECLARE_EVENT_CLASS(xfs_attr_list_class,
>  	TP_PROTO(struct xfs_attr_list_context *ctx),
> @@ -3575,6 +3580,34 @@ TRACE_EVENT(xfs_pwork_init,
>  		  __entry->nr_threads, __entry->pid)
>  )
>  
> +DECLARE_EVENT_CLASS(xfs_kmem_class,
> +	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip),
> +	TP_ARGS(size, flags, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(ssize_t, size)
> +		__field(int, flags)
> +		__field(unsigned long, caller_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->size = size;
> +		__entry->flags = flags;
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("size %zd flags 0x%x caller %pS",
> +		  __entry->size,
> +		  __entry->flags,
> +		  (char *)__entry->caller_ip)
> +)
> +
> +#define DEFINE_KMEM_EVENT(name) \
> +DEFINE_EVENT(xfs_kmem_class, name, \
> +	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip), \
> +	TP_ARGS(size, flags, caller_ip))
> +DEFINE_KMEM_EVENT(kmem_alloc);
> +DEFINE_KMEM_EVENT(kmem_alloc_large);
> +DEFINE_KMEM_EVENT(kmem_realloc);
> +DEFINE_KMEM_EVENT(kmem_zone_alloc);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> -- 
> 2.23.0.rc1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21  8:38 ` [PATCH 2/3] xfs: add kmem_alloc_io() Dave Chinner
@ 2019-08-21 13:35   ` Brian Foster
  2019-08-21 15:08     ` Darrick J. Wong
                       ` (2 more replies)
  2019-08-21 23:24   ` Christoph Hellwig
  1 sibling, 3 replies; 44+ messages in thread
From: Brian Foster @ 2019-08-21 13:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Memory we use to submit for IO needs strict alignment to the
> underlying driver contraints. Worst case, this is 512 bytes. Given
> that all allocations for IO are always a power of 2 multiple of 512
> bytes, the kernel heap provides natural alignment for objects of
> these sizes and that suffices.
> 
> Until, of course, memory debugging of some kind is turned on (e.g.
> red zones, poisoning, KASAN) and then the alignment of the heap
> objects is thrown out the window. Then we get weird IO errors and
> data corruption problems because drivers don't validate alignment
> and do the wrong thing when passed unaligned memory buffers in bios.
> 
> TO fix this, introduce kmem_alloc_io(), which will guaranteeat least

s/TO/To/

> 512 byte alignment of buffers for IO, even if memory debugging
> options are turned on. It is assumed that the minimum allocation
> size will be 512 bytes, and that sizes will be power of 2 mulitples
> of 512 bytes.
> 
> Use this everywhere we allocate buffers for IO.
> 
> This no longer fails with log recovery errors when KASAN is enabled
> due to the brd driver not handling unaligned memory buffers:
> 
> # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
>  fs/xfs/kmem.h            |  1 +
>  fs/xfs/xfs_buf.c         |  4 +--
>  fs/xfs/xfs_log.c         |  2 +-
>  fs/xfs/xfs_log_recover.c |  2 +-
>  fs/xfs/xfs_trace.h       |  1 +
>  6 files changed, 50 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index edcf393c8fd9..ec693c0fdcff 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
...
> @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
>  	return ptr;
>  }
>  
> +/*
> + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> + * returned. vmalloc always returns an aligned region.
> + */
> +void *
> +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> +{
> +	void	*ptr;
> +
> +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> +
> +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> +	if (ptr) {
> +		if (!((long)ptr & 511))
> +			return ptr;
> +		kfree(ptr);
> +	}
> +	return __kmem_vmalloc(size, flags);
> +}

Even though it is unfortunate, this seems like a quite reasonable and
isolated temporary solution to the problem to me. The one concern I have
is if/how much this could affect performance under certain
circumstances. I realize that these callsites are isolated in the common
scenario. Less common scenarios like sub-page block sizes (whether due
to explicit mkfs time format or default configurations on larger page
size systems) can fall into this path much more frequently, however.

Since this implies some kind of vm debug option is enabled, performance
itself isn't critical when this solution is active. But how bad is it in
those cases where we might depend on this more heavily? Have you
confirmed that the end configuration is still "usable," at least?

I ask because the repeated alloc/free behavior can easily be avoided via
something like an mp flag (which may require a tweak to the
kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
path once we see one unaligned allocation. That assumes this behavior is
tied to functionality that isn't dynamically configured at runtime, of
course.

Brian

> +
> +void *
> +kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> +{
> +	void	*ptr;
> +
> +	trace_kmem_alloc_large(size, flags, _RET_IP_);
> +
> +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> +	if (ptr)
> +		return ptr;
> +	return __kmem_vmalloc(size, flags);
> +}
> +
>  void *
>  kmem_realloc(const void *old, size_t newsize, xfs_km_flags_t flags)
>  {
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 267655acd426..423a1fa0fcd6 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -59,6 +59,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  }
>  
>  extern void *kmem_alloc(size_t, xfs_km_flags_t);
> +extern void *kmem_alloc_io(size_t, xfs_km_flags_t);
>  extern void *kmem_alloc_large(size_t size, xfs_km_flags_t);
>  extern void *kmem_realloc(const void *, size_t, xfs_km_flags_t);
>  static inline void  kmem_free(const void *ptr)
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index ca0849043f54..7bd1f31febfc 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -353,7 +353,7 @@ xfs_buf_allocate_memory(
>  	 */
>  	size = BBTOB(bp->b_length);
>  	if (size < PAGE_SIZE) {
> -		bp->b_addr = kmem_alloc(size, KM_NOFS);
> +		bp->b_addr = kmem_alloc_io(size, KM_NOFS);
>  		if (!bp->b_addr) {
>  			/* low memory - use alloc_page loop instead */
>  			goto use_alloc_page;
> @@ -368,7 +368,7 @@ xfs_buf_allocate_memory(
>  		}
>  		bp->b_offset = offset_in_page(bp->b_addr);
>  		bp->b_pages = bp->b_page_array;
> -		bp->b_pages[0] = virt_to_page(bp->b_addr);
> +		bp->b_pages[0] = kmem_to_page(bp->b_addr);
>  		bp->b_page_count = 1;
>  		bp->b_flags |= _XBF_KMEM;
>  		return 0;
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 7fc3c1ad36bc..1830d185d7fc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1415,7 +1415,7 @@ xlog_alloc_log(
>  		iclog->ic_prev = prev_iclog;
>  		prev_iclog = iclog;
>  
> -		iclog->ic_data = kmem_alloc_large(log->l_iclog_size,
> +		iclog->ic_data = kmem_alloc_io(log->l_iclog_size,
>  				KM_MAYFAIL);
>  		if (!iclog->ic_data)
>  			goto out_free_iclog;
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..b4a6a008986b 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return kmem_alloc_io(BBTOB(nbblks), KM_MAYFAIL);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 8bb8b4704a00..eaae275ed430 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3604,6 +3604,7 @@ DEFINE_EVENT(xfs_kmem_class, name, \
>  	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip), \
>  	TP_ARGS(size, flags, caller_ip))
>  DEFINE_KMEM_EVENT(kmem_alloc);
> +DEFINE_KMEM_EVENT(kmem_alloc_io);
>  DEFINE_KMEM_EVENT(kmem_alloc_large);
>  DEFINE_KMEM_EVENT(kmem_realloc);
>  DEFINE_KMEM_EVENT(kmem_zone_alloc);
> -- 
> 2.23.0.rc1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21  8:38 ` [PATCH 3/3] xfs: alignment check bio buffers Dave Chinner
@ 2019-08-21 13:39   ` Brian Foster
  2019-08-21 21:39     ` Dave Chinner
  2019-08-21 23:30     ` Christoph Hellwig
  2019-08-21 23:29   ` Christoph Hellwig
  1 sibling, 2 replies; 44+ messages in thread
From: Brian Foster @ 2019-08-21 13:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add memory buffer alignment validation checks to bios built in XFS
> to catch bugs that will result in silent data corruption in block
> drivers that cannot handle unaligned memory buffers but don't
> validate the incoming buffer alignment is correct.
> 
> Known drivers with these issues are xenblk, brd and pmem.
> 
> Despite there being nothing XFS specific to xfs_bio_add_page(), this
> function was created to do the required validation because the block
> layer developers that keep telling us that is not possible to
> validate buffer alignment in bio_add_page(), and even if it was
> possible it would be too much overhead to do at runtime.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
>  fs/xfs/xfs_buf.c    |  2 +-
>  fs/xfs/xfs_linux.h  |  3 +++
>  fs/xfs/xfs_log.c    |  6 +++++-
>  4 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index e2148f2d5d6b..fbaea643c000 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
>  }
>  
> +int
> +xfs_bio_add_page(
> +	struct bio	*bio,
> +	struct page	*page,
> +	unsigned int	len,
> +	unsigned int	offset)
> +{
> +	struct request_queue	*q = bio->bi_disk->queue;
> +	bool		same_page = false;
> +
> +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> +		return -EIO;
> +
> +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> +		if (bio_full(bio, len))
> +			return 0;
> +		__bio_add_page(bio, page, len, offset);
> +	}
> +	return len;
> +}
> +

Seems reasonable to me. Looks like bio_add_page() with an error check.
;) Given that, what's the need to open-code bio_add_page() here rather
than just call it after the check?

>  int
>  xfs_rw_bdev(
>  	struct block_device	*bdev,
...
> @@ -36,9 +57,12 @@ xfs_rw_bdev(
>  		unsigned int	off = offset_in_page(data);
>  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
>  
> -		while (bio_add_page(bio, page, len, off) != len) {
> +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
>  			struct bio	*prev = bio;
>  
> +			if (ret < 0)
> +				goto submit;
> +

Hmm.. is submitting the bio really the right thing to do if we get here
and have failed to add any pages to the bio? If we're already seeing
weird behavior for bios with unaligned data memory, this seems like a
recipe for similar weirdness. We'd also end up doing a partial write in
scenarios where we already know we're returning an error. Perhaps we
should create an error path or use a check similar to what is already in
xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
when we already know we're going to return an error) to call bio_endio()
to undo any chaining.

>  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
>  			bio_copy_dev(bio, prev);
>  			bio->bi_iter.bi_sector = bio_end_sector(prev);
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7bd1f31febfc..a2d499baee9c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
>  				      offset);

Similar concern here. The higher level code seems written under the
assumption that bio_add_page() returns success or zero. In this case the
error is basically tossed so we can also attempt to split/chain an empty
bio, or even better, submit a partial write and continue operating as if
nothing happened (outside of the warning). The latter case should be
handled as a log I/O error one way or another.

Brian

>  		if (rbytes < nbytes)
>  			break;
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index ca15105681ca..e71c7bf3e714 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -145,6 +145,9 @@ static inline void delay(long ticks)
>  	schedule_timeout_uninterruptible(ticks);
>  }
>  
> +int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len,
> +			unsigned int offset);
> +
>  /*
>   * XFS wrapper structure for sysfs support. It depends on external data
>   * structures and is embedded in various internal data structures to implement
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1830d185d7fc..52f7d840d09e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1673,8 +1673,12 @@ xlog_map_iclog_data(
>  		struct page	*page = kmem_to_page(data);
>  		unsigned int	off = offset_in_page(data);
>  		size_t		len = min_t(size_t, count, PAGE_SIZE - off);
> +		int		ret;
>  
> -		WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len);
> +		ret = xfs_bio_add_page(bio, page, len, off);
> +		WARN_ON_ONCE(ret != len);
> +		if (ret < 0)
> +			break;
>  
>  		data += len;
>  		count -= len;
> -- 
> 2.23.0.rc1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 13:35   ` Brian Foster
@ 2019-08-21 15:08     ` Darrick J. Wong
  2019-08-21 21:24       ` Dave Chinner
  2019-08-21 15:23     ` Eric Sandeen
  2019-08-21 21:14     ` Dave Chinner
  2 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2019-08-21 15:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Memory we use to submit for IO needs strict alignment to the
> > underlying driver contraints. Worst case, this is 512 bytes. Given
> > that all allocations for IO are always a power of 2 multiple of 512
> > bytes, the kernel heap provides natural alignment for objects of
> > these sizes and that suffices.
> > 
> > Until, of course, memory debugging of some kind is turned on (e.g.
> > red zones, poisoning, KASAN) and then the alignment of the heap
> > objects is thrown out the window. Then we get weird IO errors and
> > data corruption problems because drivers don't validate alignment
> > and do the wrong thing when passed unaligned memory buffers in bios.
> > 
> > TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> 
> s/TO/To/
> 
> > 512 byte alignment of buffers for IO, even if memory debugging
> > options are turned on. It is assumed that the minimum allocation
> > size will be 512 bytes, and that sizes will be power of 2 mulitples
> > of 512 bytes.
> > 
> > Use this everywhere we allocate buffers for IO.
> > 
> > This no longer fails with log recovery errors when KASAN is enabled
> > due to the brd driver not handling unaligned memory buffers:
> > 
> > # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
> >  fs/xfs/kmem.h            |  1 +
> >  fs/xfs/xfs_buf.c         |  4 +--
> >  fs/xfs/xfs_log.c         |  2 +-
> >  fs/xfs/xfs_log_recover.c |  2 +-
> >  fs/xfs/xfs_trace.h       |  1 +
> >  6 files changed, 50 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index edcf393c8fd9..ec693c0fdcff 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> ...
> > @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> >  	return ptr;
> >  }
> >  
> > +/*
> > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > + * returned. vmalloc always returns an aligned region.
> > + */
> > +void *
> > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > +{
> > +	void	*ptr;
> > +
> > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > +
> > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > +	if (ptr) {
> > +		if (!((long)ptr & 511))

Er... didn't you say "it needs to grab the alignment from
[blk_]queue_dma_alignment(), not use a hard coded value of 511"?

How is this different?  If this buffer is really for IO then shouldn't
we pass in the buftarg or something so that we find the real alignment?
Or check it down in the xfs_buf code that associates a page to a buffer?

Even if all that logic is hidden behind CONFIG_XFS_DEBUG?

> > +			return ptr;
> > +		kfree(ptr);
> > +	}
> > +	return __kmem_vmalloc(size, flags);

How about checking the vmalloc alignment too?  If we're going to be this
paranoid we might as well go all the way. :)

--D

> > +}
> 
> Even though it is unfortunate, this seems like a quite reasonable and
> isolated temporary solution to the problem to me. The one concern I have
> is if/how much this could affect performance under certain
> circumstances. I realize that these callsites are isolated in the common
> scenario. Less common scenarios like sub-page block sizes (whether due
> to explicit mkfs time format or default configurations on larger page
> size systems) can fall into this path much more frequently, however.
> 
> Since this implies some kind of vm debug option is enabled, performance
> itself isn't critical when this solution is active. But how bad is it in
> those cases where we might depend on this more heavily? Have you
> confirmed that the end configuration is still "usable," at least?
> 
> I ask because the repeated alloc/free behavior can easily be avoided via
> something like an mp flag (which may require a tweak to the
> kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> path once we see one unaligned allocation. That assumes this behavior is
> tied to functionality that isn't dynamically configured at runtime, of
> course.
> 
> Brian
> 
> > +
> > +void *
> > +kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> > +{
> > +	void	*ptr;
> > +
> > +	trace_kmem_alloc_large(size, flags, _RET_IP_);
> > +
> > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > +	if (ptr)
> > +		return ptr;
> > +	return __kmem_vmalloc(size, flags);
> > +}
> > +
> >  void *
> >  kmem_realloc(const void *old, size_t newsize, xfs_km_flags_t flags)
> >  {
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index 267655acd426..423a1fa0fcd6 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -59,6 +59,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> >  }
> >  
> >  extern void *kmem_alloc(size_t, xfs_km_flags_t);
> > +extern void *kmem_alloc_io(size_t, xfs_km_flags_t);
> >  extern void *kmem_alloc_large(size_t size, xfs_km_flags_t);
> >  extern void *kmem_realloc(const void *, size_t, xfs_km_flags_t);
> >  static inline void  kmem_free(const void *ptr)
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index ca0849043f54..7bd1f31febfc 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -353,7 +353,7 @@ xfs_buf_allocate_memory(
> >  	 */
> >  	size = BBTOB(bp->b_length);
> >  	if (size < PAGE_SIZE) {
> > -		bp->b_addr = kmem_alloc(size, KM_NOFS);
> > +		bp->b_addr = kmem_alloc_io(size, KM_NOFS);
> >  		if (!bp->b_addr) {
> >  			/* low memory - use alloc_page loop instead */
> >  			goto use_alloc_page;
> > @@ -368,7 +368,7 @@ xfs_buf_allocate_memory(
> >  		}
> >  		bp->b_offset = offset_in_page(bp->b_addr);
> >  		bp->b_pages = bp->b_page_array;
> > -		bp->b_pages[0] = virt_to_page(bp->b_addr);
> > +		bp->b_pages[0] = kmem_to_page(bp->b_addr);
> >  		bp->b_page_count = 1;
> >  		bp->b_flags |= _XBF_KMEM;
> >  		return 0;
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 7fc3c1ad36bc..1830d185d7fc 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -1415,7 +1415,7 @@ xlog_alloc_log(
> >  		iclog->ic_prev = prev_iclog;
> >  		prev_iclog = iclog;
> >  
> > -		iclog->ic_data = kmem_alloc_large(log->l_iclog_size,
> > +		iclog->ic_data = kmem_alloc_io(log->l_iclog_size,
> >  				KM_MAYFAIL);
> >  		if (!iclog->ic_data)
> >  			goto out_free_iclog;
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 13d1d3e95b88..b4a6a008986b 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -125,7 +125,7 @@ xlog_alloc_buffer(
> >  	if (nbblks > 1 && log->l_sectBBsize > 1)
> >  		nbblks += log->l_sectBBsize;
> >  	nbblks = round_up(nbblks, log->l_sectBBsize);
> > -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> > +	return kmem_alloc_io(BBTOB(nbblks), KM_MAYFAIL);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 8bb8b4704a00..eaae275ed430 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3604,6 +3604,7 @@ DEFINE_EVENT(xfs_kmem_class, name, \
> >  	TP_PROTO(ssize_t size, int flags, unsigned long caller_ip), \
> >  	TP_ARGS(size, flags, caller_ip))
> >  DEFINE_KMEM_EVENT(kmem_alloc);
> > +DEFINE_KMEM_EVENT(kmem_alloc_io);
> >  DEFINE_KMEM_EVENT(kmem_alloc_large);
> >  DEFINE_KMEM_EVENT(kmem_realloc);
> >  DEFINE_KMEM_EVENT(kmem_zone_alloc);
> > -- 
> > 2.23.0.rc1
> > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 13:35   ` Brian Foster
  2019-08-21 15:08     ` Darrick J. Wong
@ 2019-08-21 15:23     ` Eric Sandeen
  2019-08-21 21:14     ` Dave Chinner
  2 siblings, 0 replies; 44+ messages in thread
From: Eric Sandeen @ 2019-08-21 15:23 UTC (permalink / raw)
  To: Brian Foster, Dave Chinner; +Cc: linux-xfs

On 8/21/19 8:35 AM, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Memory we use to submit for IO needs strict alignment to the
>> underlying driver contraints. Worst case, this is 512 bytes. Given
>> that all allocations for IO are always a power of 2 multiple of 512
>> bytes, the kernel heap provides natural alignment for objects of
>> these sizes and that suffices.
>>
>> Until, of course, memory debugging of some kind is turned on (e.g.
>> red zones, poisoning, KASAN) and then the alignment of the heap
>> objects is thrown out the window. Then we get weird IO errors and
>> data corruption problems because drivers don't validate alignment
>> and do the wrong thing when passed unaligned memory buffers in bios.
>>
>> TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> 
> s/TO/To/
> 
>> 512 byte alignment of buffers for IO, even if memory debugging
>> options are turned on. It is assumed that the minimum allocation
>> size will be 512 bytes, and that sizes will be power of 2 mulitples
>> of 512 bytes.
>>
>> Use this everywhere we allocate buffers for IO.
>>
>> This no longer fails with log recovery errors when KASAN is enabled
>> due to the brd driver not handling unaligned memory buffers:
>>
>> # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> ---
>>  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
>>  fs/xfs/kmem.h            |  1 +
>>  fs/xfs/xfs_buf.c         |  4 +--
>>  fs/xfs/xfs_log.c         |  2 +-
>>  fs/xfs/xfs_log_recover.c |  2 +-
>>  fs/xfs/xfs_trace.h       |  1 +
>>  6 files changed, 50 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
>> index edcf393c8fd9..ec693c0fdcff 100644
>> --- a/fs/xfs/kmem.c
>> +++ b/fs/xfs/kmem.c
> ...
>> @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
>>  	return ptr;
>>  }
>>  
>> +/*
>> + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
>> + * returned. vmalloc always returns an aligned region.
>> + */
>> +void *
>> +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
>> +{
>> +	void	*ptr;
>> +
>> +	trace_kmem_alloc_io(size, flags, _RET_IP_);
>> +
>> +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
>> +	if (ptr) {
>> +		if (!((long)ptr & 511))
>> +			return ptr;
>> +		kfree(ptr);
>> +	}
>> +	return __kmem_vmalloc(size, flags);
>> +}
> 
> Even though it is unfortunate, this seems like a quite reasonable and
> isolated temporary solution to the problem to me. The one concern I have
> is if/how much this could affect performance under certain
> circumstances. I realize that these callsites are isolated in the common
> scenario. Less common scenarios like sub-page block sizes (whether due
> to explicit mkfs time format or default configurations on larger page
> size systems) can fall into this path much more frequently, however.
> 
> Since this implies some kind of vm debug option is enabled, performance
> itself isn't critical when this solution is active. But how bad is it in
> those cases where we might depend on this more heavily? Have you
> confirmed that the end configuration is still "usable," at least?
> 
> I ask because the repeated alloc/free behavior can easily be avoided via
> something like an mp flag (which may require a tweak to the
> kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> path once we see one unaligned allocation. That assumes this behavior is
> tied to functionality that isn't dynamically configured at runtime, of
> course.

This seems like a good idea to me.

-Eric

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 13:35   ` Brian Foster
  2019-08-21 15:08     ` Darrick J. Wong
  2019-08-21 15:23     ` Eric Sandeen
@ 2019-08-21 21:14     ` Dave Chinner
  2019-08-22 13:40       ` Brian Foster
  2 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-21 21:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Memory we use to submit for IO needs strict alignment to the
> > underlying driver contraints. Worst case, this is 512 bytes. Given
> > that all allocations for IO are always a power of 2 multiple of 512
> > bytes, the kernel heap provides natural alignment for objects of
> > these sizes and that suffices.
> > 
> > Until, of course, memory debugging of some kind is turned on (e.g.
> > red zones, poisoning, KASAN) and then the alignment of the heap
> > objects is thrown out the window. Then we get weird IO errors and
> > data corruption problems because drivers don't validate alignment
> > and do the wrong thing when passed unaligned memory buffers in bios.
> > 
> > TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> 
> s/TO/To/
> 
> > 512 byte alignment of buffers for IO, even if memory debugging
> > options are turned on. It is assumed that the minimum allocation
> > size will be 512 bytes, and that sizes will be power of 2 mulitples
> > of 512 bytes.
> > 
> > Use this everywhere we allocate buffers for IO.
> > 
> > This no longer fails with log recovery errors when KASAN is enabled
> > due to the brd driver not handling unaligned memory buffers:
> > 
> > # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
> >  fs/xfs/kmem.h            |  1 +
> >  fs/xfs/xfs_buf.c         |  4 +--
> >  fs/xfs/xfs_log.c         |  2 +-
> >  fs/xfs/xfs_log_recover.c |  2 +-
> >  fs/xfs/xfs_trace.h       |  1 +
> >  6 files changed, 50 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index edcf393c8fd9..ec693c0fdcff 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> ...
> > @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> >  	return ptr;
> >  }
> >  
> > +/*
> > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > + * returned. vmalloc always returns an aligned region.
> > + */
> > +void *
> > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > +{
> > +	void	*ptr;
> > +
> > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > +
> > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > +	if (ptr) {
> > +		if (!((long)ptr & 511))
> > +			return ptr;
> > +		kfree(ptr);
> > +	}
> > +	return __kmem_vmalloc(size, flags);
> > +}
> 
> Even though it is unfortunate, this seems like a quite reasonable and
> isolated temporary solution to the problem to me. The one concern I have
> is if/how much this could affect performance under certain
> circumstances.

Can't measure a difference on 4k block size filesystems. It's only
used for log recovery and then for allocation AGF/AGI buffers on
512 byte sector devices. Anything using 4k sectors only hits it
during mount. So for default configs with memory posioning/KASAN
enabled, the massive overhead of poisoning/tracking makes this
disappear in the noise.

For 1k block size filesystems, it gets hit much harder, but
there's no noticable increase in runtime of xfstests vs 4k block
size with KASAN enabled. The big increase in overhead comes from
enabling KASAN (takes 3x longer than without), not doing one extra
allocation/free pair.

> I realize that these callsites are isolated in the common
> scenario. Less common scenarios like sub-page block sizes (whether due
> to explicit mkfs time format or default configurations on larger page
> size systems) can fall into this path much more frequently, however.

*nod*

> Since this implies some kind of vm debug option is enabled, performance
> itself isn't critical when this solution is active. But how bad is it in
> those cases where we might depend on this more heavily? Have you
> confirmed that the end configuration is still "usable," at least?

No noticable difference, most definitely still usable. 

> I ask because the repeated alloc/free behavior can easily be avoided via
> something like an mp flag (which may require a tweak to the

What's an "mp flag"?

> kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> path once we see one unaligned allocation. That assumes this behavior is
> tied to functionality that isn't dynamically configured at runtime, of
> course.

vmalloc() has a _lot_ more overhead than kmalloc (especially when
vmalloc has to do multiple allocations itself to set up page table
entries) so there is still an overall gain to be had by trying
kmalloc even if 50% of allocations are unaligned.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 15:08     ` Darrick J. Wong
@ 2019-08-21 21:24       ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-21 21:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Wed, Aug 21, 2019 at 08:08:01AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Memory we use to submit for IO needs strict alignment to the
> > > underlying driver contraints. Worst case, this is 512 bytes. Given
> > > that all allocations for IO are always a power of 2 multiple of 512
> > > bytes, the kernel heap provides natural alignment for objects of
> > > these sizes and that suffices.
> > > 
> > > Until, of course, memory debugging of some kind is turned on (e.g.
> > > red zones, poisoning, KASAN) and then the alignment of the heap
> > > objects is thrown out the window. Then we get weird IO errors and
> > > data corruption problems because drivers don't validate alignment
> > > and do the wrong thing when passed unaligned memory buffers in bios.
> > > 
> > > TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> > 
> > s/TO/To/
> > 
> > > 512 byte alignment of buffers for IO, even if memory debugging
> > > options are turned on. It is assumed that the minimum allocation
> > > size will be 512 bytes, and that sizes will be power of 2 mulitples
> > > of 512 bytes.
> > > 
> > > Use this everywhere we allocate buffers for IO.
> > > 
> > > This no longer fails with log recovery errors when KASAN is enabled
> > > due to the brd driver not handling unaligned memory buffers:
> > > 
> > > # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
> > >  fs/xfs/kmem.h            |  1 +
> > >  fs/xfs/xfs_buf.c         |  4 +--
> > >  fs/xfs/xfs_log.c         |  2 +-
> > >  fs/xfs/xfs_log_recover.c |  2 +-
> > >  fs/xfs/xfs_trace.h       |  1 +
> > >  6 files changed, 50 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index edcf393c8fd9..ec693c0fdcff 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > ...
> > > @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> > >  	return ptr;
> > >  }
> > >  
> > > +/*
> > > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > > + * returned. vmalloc always returns an aligned region.
> > > + */
> > > +void *
> > > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > > +{
> > > +	void	*ptr;
> > > +
> > > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > > +
> > > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > > +	if (ptr) {
> > > +		if (!((long)ptr & 511))
> 
> Er... didn't you say "it needs to grab the alignment from
> [blk_]queue_dma_alignment(), not use a hard coded value of 511"?

That's fine for the bio, which has a direct pointer to the request
queue. Here the allocation may be a long way away from the IO
itself, and we migh actually be slicing and dicing an allocated
region into a bio and not just the whole region itself.

So I've just taken the worst case - queue_dma_alignment() returns
511 if no queue is supplied, so this is the worst case alignment
that the block device will require. We don't need any more to fix
the problem right now, and getting alignment into this function in
all cases makes it a bit more complicated...

> How is this different?  If this buffer is really for IO then shouldn't
> we pass in the buftarg or something so that we find the real alignment?
> Or check it down in the xfs_buf code that associates a page to a buffer?

No, at worst we should in the alignment - there is no good reason to
be passing buftargs, block devices or request queues into memory
allocation APIs. I'll have a look at this.

> Even if all that logic is hidden behind CONFIG_XFS_DEBUG?

That's no good because memory debugging can be turned on without
CONFIG_XFS_DEBUG (how we tripped over the xenblk case).

> 
> > > +			return ptr;
> > > +		kfree(ptr);
> > > +	}
> > > +	return __kmem_vmalloc(size, flags);
> 
> How about checking the vmalloc alignment too?  If we're going to be this
> paranoid we might as well go all the way. :)

if vmalloc is returning unaligned regions, lots of stuff is going to
break everywhere. It has to be page aligned because of the pte
mappings it requires. Memory debugging puts guard pages around
vmalloc, it doesn't change the alignment.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 13:39   ` Brian Foster
@ 2019-08-21 21:39     ` Dave Chinner
  2019-08-22 13:47       ` Brian Foster
  2019-08-21 23:30     ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-21 21:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> > 
> > Known drivers with these issues are xenblk, brd and pmem.
> > 
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
> >  fs/xfs/xfs_buf.c    |  2 +-
> >  fs/xfs/xfs_linux.h  |  3 +++
> >  fs/xfs/xfs_log.c    |  6 +++++-
> >  4 files changed, 38 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > index e2148f2d5d6b..fbaea643c000 100644
> > --- a/fs/xfs/xfs_bio_io.c
> > +++ b/fs/xfs/xfs_bio_io.c
> > @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> >  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
> >  }
> >  
> > +int
> > +xfs_bio_add_page(
> > +	struct bio	*bio,
> > +	struct page	*page,
> > +	unsigned int	len,
> > +	unsigned int	offset)
> > +{
> > +	struct request_queue	*q = bio->bi_disk->queue;
> > +	bool		same_page = false;
> > +
> > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +		return -EIO;
> > +
> > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +		if (bio_full(bio, len))
> > +			return 0;
> > +		__bio_add_page(bio, page, len, offset);
> > +	}
> > +	return len;
> > +}
> > +
> 
> Seems reasonable to me. Looks like bio_add_page() with an error check.
> ;)

It is exactly that.

> Given that, what's the need to open-code bio_add_page() here rather
> than just call it after the check?

Largely because I wasn't sure exactly where the best place was to
add the check. Copy first, hack to pieces, make work, then worry
about other stuff :)

I could change it to calling calling bio_add_page() directly. Not
really that fussed as it's all exported functionality...


> >  int
> >  xfs_rw_bdev(
> >  	struct block_device	*bdev,
> ...
> > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> >  		unsigned int	off = offset_in_page(data);
> >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> >  
> > -		while (bio_add_page(bio, page, len, off) != len) {
> > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> >  			struct bio	*prev = bio;
> >  
> > +			if (ret < 0)
> > +				goto submit;
> > +
> 
> Hmm.. is submitting the bio really the right thing to do if we get here
> and have failed to add any pages to the bio?

Yes, because we really should wait for chained bios to complete
before returning. we can only do that by waiting on completion for
the entire chain, and the simplest way to do that is submit the
bio...

> If we're already seeing
> weird behavior for bios with unaligned data memory, this seems like a
> recipe for similar weirdness. We'd also end up doing a partial write in
> scenarios where we already know we're returning an error.

THe partial write will happen anyway on a chained bio, something we
know already happens from the bug in bio chaining that was found in
this code earlier in the cycle.

> Perhaps we
> should create an error path or use a check similar to what is already in
> xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> when we already know we're going to return an error) to call bio_endio()
> to undo any chaining.

xfs_buf_ioapply_map() only fails with an error if it occurs on the
first bio_add_page() call to a bio. If it fails on the second, then
it submits the bio, allocates a new bio, and tries again. If that
fails, then it frees the bio and does error processing.

That code can get away with such behaviour because it is not
chaining bios. each bio is it's own IO, and failures there already
result in partial IO completion with an error onthe buffer. This
code here will fail with a partial IO completion and return an error.

So, in effect, this code provides exactly the same behaviour and
result as the xfs_buf_ioapply_map() code does...

> 
> >  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
> >  			bio_copy_dev(bio, prev);
> >  			bio->bi_iter.bi_sector = bio_end_sector(prev);
> ...
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 7bd1f31febfc..a2d499baee9c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> >  		if (nbytes > size)
> >  			nbytes = size;
> >  
> > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> >  				      offset);
> 
> Similar concern here. The higher level code seems written under the
> assumption that bio_add_page() returns success or zero. In this case the
> error is basically tossed so we can also attempt to split/chain an empty
> bio, or even better, submit a partial write and continue operating as if
> nothing happened (outside of the warning). The latter case should be
> handled as a log I/O error one way or another.

See above - it does result in a failure when offset/nbytes is not
aligned to the underlying device...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] xfs: add kmem allocation trace points
  2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
  2019-08-21 13:34   ` Brian Foster
@ 2019-08-21 23:20   ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-21 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good, especialy with the size tweak suggest by Brian:

Reviewed-by: Christoph Hellwig <hch@lst.de>

I'd just split the AG constant move into a separate patch to better
document it, though.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21  8:38 ` [PATCH 2/3] xfs: add kmem_alloc_io() Dave Chinner
  2019-08-21 13:35   ` Brian Foster
@ 2019-08-21 23:24   ` Christoph Hellwig
  2019-08-22  0:31     ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-21 23:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +
> +/*
> + * __vmalloc() will allocate data pages and auxillary structures (e.g.
> + * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context here. Hence
> + * we need to tell memory reclaim that we are in such a context via
> + * PF_MEMALLOC_NOFS to prevent memory reclaim re-entering the filesystem here
> + * and potentially deadlocking.
> + */

Btw, I think we should eventually kill off KM_NOFS and just use
PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
But that's something for the future.

> +/*
> + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> + * returned. vmalloc always returns an aligned region.
> + */
> +void *
> +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> +{
> +	void	*ptr;
> +
> +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> +
> +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> +	if (ptr) {
> +		if (!((long)ptr & 511))

Please use unsigned long (or uintptr_t if you want to be fancy),
and (SECTOR_SIZE - 1).

As said elsewhere if we want to be fancy we should probably pass a
request queue or something pointing to it.  But then again I don't think
it really matters much, it would save us the reallocation with slub debug
for a bunch of scsi adapters that support dword aligned I/O.  But last
least the interface would be a little more obvious.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21  8:38 ` [PATCH 3/3] xfs: alignment check bio buffers Dave Chinner
  2019-08-21 13:39   ` Brian Foster
@ 2019-08-21 23:29   ` Christoph Hellwig
  2019-08-22  0:37     ` Dave Chinner
  2019-08-22  2:50     ` Ming Lei
  1 sibling, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-21 23:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Jens Axboe, linux-block

On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add memory buffer alignment validation checks to bios built in XFS
> to catch bugs that will result in silent data corruption in block
> drivers that cannot handle unaligned memory buffers but don't
> validate the incoming buffer alignment is correct.
> 
> Known drivers with these issues are xenblk, brd and pmem.
> 
> Despite there being nothing XFS specific to xfs_bio_add_page(), this
> function was created to do the required validation because the block
> layer developers that keep telling us that is not possible to
> validate buffer alignment in bio_add_page(), and even if it was
> possible it would be too much overhead to do at runtime.

I really don't think we should life this to XFS, but instead fix it
in the block layer.  And that is not only because I have a pending
series lifting bits you are touching to the block layer..

> +int
> +xfs_bio_add_page(
> +	struct bio	*bio,
> +	struct page	*page,
> +	unsigned int	len,
> +	unsigned int	offset)
> +{
> +	struct request_queue	*q = bio->bi_disk->queue;
> +	bool		same_page = false;
> +
> +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> +		return -EIO;
> +
> +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> +		if (bio_full(bio, len))
> +			return 0;
> +		__bio_add_page(bio, page, len, offset);
> +	}
> +	return len;

I know Jens disagree, but with the amount of bugs we've been hitting
thangs to slub (and I'm pretty sure we have a more hiding outside of
XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

Note that all current callers of bio_add_page can only really check
for the return value != the added len anyway, so it is not going to
make anything worse.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 13:39   ` Brian Foster
  2019-08-21 21:39     ` Dave Chinner
@ 2019-08-21 23:30     ` Christoph Hellwig
  2019-08-22  0:44       ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-21 23:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> >  		unsigned int	off = offset_in_page(data);
> >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> >  
> > -		while (bio_add_page(bio, page, len, off) != len) {
> > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> >  			struct bio	*prev = bio;
> >  
> > +			if (ret < 0)
> > +				goto submit;
> > +
> 
> Hmm.. is submitting the bio really the right thing to do if we get here
> and have failed to add any pages to the bio? If we're already seeing
> weird behavior for bios with unaligned data memory, this seems like a
> recipe for similar weirdness. We'd also end up doing a partial write in
> scenarios where we already know we're returning an error. Perhaps we
> should create an error path or use a check similar to what is already in
> xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> when we already know we're going to return an error) to call bio_endio()
> to undo any chaining.

It is not the right thing to do.  Calling bio_endio after setting
an error is the right thing to do (modulo any other cleanup needed).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 23:24   ` Christoph Hellwig
@ 2019-08-22  0:31     ` Dave Chinner
  2019-08-22  7:59       ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22  0:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Aug 21, 2019 at 04:24:40PM -0700, Christoph Hellwig wrote:
> > +
> > +/*
> > + * __vmalloc() will allocate data pages and auxillary structures (e.g.
> > + * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context here. Hence
> > + * we need to tell memory reclaim that we are in such a context via
> > + * PF_MEMALLOC_NOFS to prevent memory reclaim re-entering the filesystem here
> > + * and potentially deadlocking.
> > + */
> 
> Btw, I think we should eventually kill off KM_NOFS and just use
> PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> But that's something for the future.

Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
at high levels - we'll still need to annotate callers that use KM_NOFS
to avoid lockdep false positives. i.e. any code that can be called from
GFP_KERNEL and reclaim context will throw false positives from
lockdep if we don't annotate tehm correctly....

> > +/*
> > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > + * returned. vmalloc always returns an aligned region.
> > + */
> > +void *
> > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > +{
> > +	void	*ptr;
> > +
> > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > +
> > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > +	if (ptr) {
> > +		if (!((long)ptr & 511))
> 
> Please use unsigned long (or uintptr_t if you want to be fancy),
> and (SECTOR_SIZE - 1).

Already changed it to uintptr_t when I did...

> 
> As said elsewhere if we want to be fancy we should probably pass a
> request queue or something pointing to it.

.... this. Well, not exactly this - I pass in the alignment required
as an int, and the callers get it from the request queue....

> But then again I don't think
> it really matters much, it would save us the reallocation with slub debug
> for a bunch of scsi adapters that support dword aligned I/O.  But last
> least the interface would be a little more obvious.

Yup, just smoke testing it now before I resend.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 23:29   ` Christoph Hellwig
@ 2019-08-22  0:37     ` Dave Chinner
  2019-08-22  8:03       ` Christoph Hellwig
  2019-08-22  2:50     ` Ming Lei
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22  0:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, Jens Axboe, linux-block

On Wed, Aug 21, 2019 at 04:29:45PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> > 
> > Known drivers with these issues are xenblk, brd and pmem.
> > 
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
> 
> I really don't think we should life this to XFS, but instead fix it
> in the block layer.  And that is not only because I have a pending
> series lifting bits you are touching to the block layer..

I agree, but....

> 
> > +int
> > +xfs_bio_add_page(
> > +	struct bio	*bio,
> > +	struct page	*page,
> > +	unsigned int	len,
> > +	unsigned int	offset)
> > +{
> > +	struct request_queue	*q = bio->bi_disk->queue;
> > +	bool		same_page = false;
> > +
> > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +		return -EIO;
> > +
> > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +		if (bio_full(bio, len))
> > +			return 0;
> > +		__bio_add_page(bio, page, len, offset);
> > +	}
> > +	return len;
> 
> I know Jens disagree, but with the amount of bugs we've been hitting
> thangs to slub (and I'm pretty sure we have a more hiding outside of
> XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

... I'm not prepared to fight this battle to get this initial fix
into the code. Get the fix merged, then we can 

> Note that all current callers of bio_add_page can only really check
> for the return value != the added len anyway, so it is not going to
> make anything worse.

It does make things worse - it turns multi-bio chaining loops like
the one xfs_rw_bdev() into an endless loop as they don't make
progress - they just keep allocating a new bio and retrying the same
badly aligned buffer and failing. So if we want an alignment failure
to error out, callers need to handle the failure, not treat it like
a full bio.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 23:30     ` Christoph Hellwig
@ 2019-08-22  0:44       ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-22  0:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Wed, Aug 21, 2019 at 04:30:41PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> > >  		unsigned int	off = offset_in_page(data);
> > >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> > >  
> > > -		while (bio_add_page(bio, page, len, off) != len) {
> > > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> > >  			struct bio	*prev = bio;
> > >  
> > > +			if (ret < 0)
> > > +				goto submit;
> > > +
> > 
> > Hmm.. is submitting the bio really the right thing to do if we get here
> > and have failed to add any pages to the bio? If we're already seeing
> > weird behavior for bios with unaligned data memory, this seems like a
> > recipe for similar weirdness. We'd also end up doing a partial write in
> > scenarios where we already know we're returning an error. Perhaps we
> > should create an error path or use a check similar to what is already in
> > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > when we already know we're going to return an error) to call bio_endio()
> > to undo any chaining.
> 
> It is not the right thing to do.  Calling bio_endio after setting
> an error is the right thing to do (modulo any other cleanup needed).

bio_endio() doesn't wait for completion or all the IO in progress.

In fact, if we have chained bios still in progress, it does
absolutely nothing:

void bio_endio(struct bio *bio)
{
again:
>>>>>   if (!bio_remaining_done(bio))
                return;

and so we return still with the memory we've put into the buffers in
active use by the chained bios under IO. On error, we'll free the
allocated buffer immediately, and that means we've got a
use-after-free as the bios in progress still have references to it.
If it's heap memory we are using here, then that's a memory
corruption (read) or kernel memory leak (write) vector.

So we have to wait for IO completion before we return from this
function, and AFAICT, the only way to do that is to call
submit_bio_wait() on the parent of the bio chain to wait for all
child bios to drop their references and call bio_endio() on the
parent of the chain....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 23:29   ` Christoph Hellwig
  2019-08-22  0:37     ` Dave Chinner
@ 2019-08-22  2:50     ` Ming Lei
  2019-08-22  4:49       ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Ming Lei @ 2019-08-22  2:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, open list:XFS FILESYSTEM, Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Add memory buffer alignment validation checks to bios built in XFS
> > to catch bugs that will result in silent data corruption in block
> > drivers that cannot handle unaligned memory buffers but don't
> > validate the incoming buffer alignment is correct.
> >
> > Known drivers with these issues are xenblk, brd and pmem.
> >
> > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > function was created to do the required validation because the block
> > layer developers that keep telling us that is not possible to
> > validate buffer alignment in bio_add_page(), and even if it was
> > possible it would be too much overhead to do at runtime.
>
> I really don't think we should life this to XFS, but instead fix it
> in the block layer.  And that is not only because I have a pending
> series lifting bits you are touching to the block layer..
>
> > +int
> > +xfs_bio_add_page(
> > +     struct bio      *bio,
> > +     struct page     *page,
> > +     unsigned int    len,
> > +     unsigned int    offset)
> > +{
> > +     struct request_queue    *q = bio->bi_disk->queue;
> > +     bool            same_page = false;
> > +
> > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > +             return -EIO;
> > +
> > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > +             if (bio_full(bio, len))
> > +                     return 0;
> > +             __bio_add_page(bio, page, len, offset);
> > +     }
> > +     return len;
>
> I know Jens disagree, but with the amount of bugs we've been hitting
> thangs to slub (and I'm pretty sure we have a more hiding outside of
> XFS) I think we need to add the blk_rq_aligned check to bio_add_page.

It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
size aligned, instead of DMA aligned only.

Also not sure all users may setup bio->bi_disk well before adding page to bio,
since it is allowed to do that now.

If slub buffer crosses two pages, block layer may not handle it at all
even though
un-aligned 'offset' issue is solved.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  2:50     ` Ming Lei
@ 2019-08-22  4:49       ` Dave Chinner
  2019-08-22  7:23         ` Ming Lei
  2019-08-22  8:08         ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-22  4:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, open list:XFS FILESYSTEM, Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Add memory buffer alignment validation checks to bios built in XFS
> > > to catch bugs that will result in silent data corruption in block
> > > drivers that cannot handle unaligned memory buffers but don't
> > > validate the incoming buffer alignment is correct.
> > >
> > > Known drivers with these issues are xenblk, brd and pmem.
> > >
> > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > function was created to do the required validation because the block
> > > layer developers that keep telling us that is not possible to
> > > validate buffer alignment in bio_add_page(), and even if it was
> > > possible it would be too much overhead to do at runtime.
> >
> > I really don't think we should life this to XFS, but instead fix it
> > in the block layer.  And that is not only because I have a pending
> > series lifting bits you are touching to the block layer..
> >
> > > +int
> > > +xfs_bio_add_page(
> > > +     struct bio      *bio,
> > > +     struct page     *page,
> > > +     unsigned int    len,
> > > +     unsigned int    offset)
> > > +{
> > > +     struct request_queue    *q = bio->bi_disk->queue;
> > > +     bool            same_page = false;
> > > +
> > > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > +             return -EIO;
> > > +
> > > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > +             if (bio_full(bio, len))
> > > +                     return 0;
> > > +             __bio_add_page(bio, page, len, offset);
> > > +     }
> > > +     return len;
> >
> > I know Jens disagree, but with the amount of bugs we've been hitting
> > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> 
> It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> size aligned, instead of DMA aligned only.

News to me.

AFAIA, the overall _IO_ that is being built needs to be a multiple
of the logical block size in total size (i.e. bio->bi_iter.size)
because sub sector IO is not allowed. But queue DMA limits are not
defined in sectors - they define the scatter/gather DMA capability
of the hardware, and that's what individual segments (bvecs) need to
align to.  That's what blk_rq_aligned() checks here - that the bvec
segment aligns to what the underlying driver(s) requires, not that
the entire IO is sector sized and aligned.

Also, think about multipage bvecs - the pages we are spanning here
are contiguous pages, so this should end up merging them and turning
it into a single multipage bvec whose length is sector size
aligned...

> Also not sure all users may setup bio->bi_disk well before adding page to bio,
> since it is allowed to do that now.

XFS does, so I just don't care about random users of bio_add_page()
in this patch. Somebody else can run the block layer gauntlet to get
these checks moved into generic code and they've already been
rejected twice as unnecessary.

> If slub buffer crosses two pages, block layer may not handle it at all
> even though
> un-aligned 'offset' issue is solved.

A slub buffer crossing two _contiguous_ pages should end up merged
as a multipage bvec. But I'm curious, what does adding multiple
contiguous pages to a bio actually break?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  4:49       ` Dave Chinner
@ 2019-08-22  7:23         ` Ming Lei
  2019-08-22  8:08         ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Ming Lei @ 2019-08-22  7:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ming Lei, Christoph Hellwig, open list:XFS FILESYSTEM,
	Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > On Thu, Aug 22, 2019 at 8:06 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > Add memory buffer alignment validation checks to bios built in XFS
> > > > to catch bugs that will result in silent data corruption in block
> > > > drivers that cannot handle unaligned memory buffers but don't
> > > > validate the incoming buffer alignment is correct.
> > > >
> > > > Known drivers with these issues are xenblk, brd and pmem.
> > > >
> > > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > > function was created to do the required validation because the block
> > > > layer developers that keep telling us that is not possible to
> > > > validate buffer alignment in bio_add_page(), and even if it was
> > > > possible it would be too much overhead to do at runtime.
> > >
> > > I really don't think we should life this to XFS, but instead fix it
> > > in the block layer.  And that is not only because I have a pending
> > > series lifting bits you are touching to the block layer..
> > >
> > > > +int
> > > > +xfs_bio_add_page(
> > > > +     struct bio      *bio,
> > > > +     struct page     *page,
> > > > +     unsigned int    len,
> > > > +     unsigned int    offset)
> > > > +{
> > > > +     struct request_queue    *q = bio->bi_disk->queue;
> > > > +     bool            same_page = false;
> > > > +
> > > > +     if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > > +             return -EIO;
> > > > +
> > > > +     if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > > +             if (bio_full(bio, len))
> > > > +                     return 0;
> > > > +             __bio_add_page(bio, page, len, offset);
> > > > +     }
> > > > +     return len;
> > >
> > > I know Jens disagree, but with the amount of bugs we've been hitting
> > > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> > 
> > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > size aligned, instead of DMA aligned only.
> 
> News to me.
> 
> AFAIA, the overall _IO_ that is being built needs to be a multiple
> of the logical block size in total size (i.e. bio->bi_iter.size)

Right.

> because sub sector IO is not allowed. But queue DMA limits are not
> defined in sectors - they define the scatter/gather DMA capability
> of the hardware, and that's what individual segments (bvecs) need to
> align to.  That's what blk_rq_aligned() checks here - that the bvec

Segment isn't same with bvec. We build segment via scatterlist interface
from bvecs in case that driver needs segment for DMA between CPU and
HBA. The built segment has to respect every kinds of queue limits.

Now there are two kinds of bio, one is called fs bio, the other one is
bio for doing IO from/to the device. Block layer splits fs bio into
bios with proper size for doing IO.

If one bvec is added with un-aligned length to fs bio, and if this bvec
can't be merged with the following ones, how can block layer handle that?
For example, this bvec is un-aligned with virt boundary, then one single
bio is allocated for doing IO of this bvec, then sub-sector IO is
generated.

> segment aligns to what the underlying driver(s) requires, not that
> the entire IO is sector sized and aligned.

Not every drivers need to handle segment, some drivers simply handle
single-page bvec(pmem, brd, zram, ...) or multi-page bvec(loop).

Then un-aligned bvec may cause trouble for drivers which single-page bvec.

> 
> Also, think about multipage bvecs - the pages we are spanning here
> are contiguous pages, so this should end up merging them and turning
> it into a single multipage bvec whose length is sector size
> aligned...

This way works for drivers which use segment, and most of drivers
belong to this type.

> 
> > Also not sure all users may setup bio->bi_disk well before adding page to bio,
> > since it is allowed to do that now.
> 
> XFS does, so I just don't care about random users of bio_add_page()
> in this patch. Somebody else can run the block layer gauntlet to get
> these checks moved into generic code and they've already been
> rejected twice as unnecessary.

If the check is to be added on bio_add_page(), every users have to be
audited.

> 
> > If slub buffer crosses two pages, block layer may not handle it at all
> > even though
> > un-aligned 'offset' issue is solved.
> 
> A slub buffer crossing two _contiguous_ pages should end up merged
> as a multipage bvec. But I'm curious, what does adding multiple
> contiguous pages to a bio actually break?

Some drivers don't or can't handle multi-page bvec, we have to split
it into single-page bvec, then un-aligned bvec is seen by this drivers.

thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22  0:31     ` Dave Chinner
@ 2019-08-22  7:59       ` Christoph Hellwig
  2019-08-22  8:51         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-22  7:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Peter Zijlstra, Ingo Molnar, Will Deacon,
	linux-kernel, linux-mm

On Thu, Aug 22, 2019 at 10:31:32AM +1000, Dave Chinner wrote:
> > Btw, I think we should eventually kill off KM_NOFS and just use
> > PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> > But that's something for the future.
> 
> Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
> at high levels - we'll still need to annotate callers that use KM_NOFS
> to avoid lockdep false positives. i.e. any code that can be called from
> GFP_KERNEL and reclaim context will throw false positives from
> lockdep if we don't annotate tehm correctly....

Oh well.  For now we have the XFS kmem_wrappers to turn that into
GFP_NOFS so we shouldn't be too worried, but I think that is something
we should fix in lockdep to ensure it is generally useful.  I've added
the maintainers and relevant lists to kick off a discussion.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  0:37     ` Dave Chinner
@ 2019-08-22  8:03       ` Christoph Hellwig
  2019-08-22 10:17         ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-22  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 10:37:45AM +1000, Dave Chinner wrote:
> > I know Jens disagree, but with the amount of bugs we've been hitting
> > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> 
> ... I'm not prepared to fight this battle to get this initial fix
> into the code. Get the fix merged, then we can 

Well, the initial fix are the first two patches.  This patch really
just adds a safety belt.  I'll happily take over the effort to get
sensible checks in the block code if you give me a couple weeks,
in the meantime I'd prefer if we could skip this third patch for now.

> > Note that all current callers of bio_add_page can only really check
> > for the return value != the added len anyway, so it is not going to
> > make anything worse.
> 
> It does make things worse - it turns multi-bio chaining loops like
> the one xfs_rw_bdev() into an endless loop as they don't make
> progress - they just keep allocating a new bio and retrying the same
> badly aligned buffer and failing. So if we want an alignment failure
> to error out, callers need to handle the failure, not treat it like
> a full bio.

True.  And we need to improve the interface here, which I'm going
to try to fit into my series that lift some common bio filling code
to the core.  I'll add you to the cc list.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  4:49       ` Dave Chinner
  2019-08-22  7:23         ` Ming Lei
@ 2019-08-22  8:08         ` Christoph Hellwig
  2019-08-22 10:20           ` Ming Lei
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-22  8:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ming Lei, Christoph Hellwig, open list:XFS FILESYSTEM,
	Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > size aligned, instead of DMA aligned only.

Even if len would have to be a multiple of the sector size, that doesn't
mean calling blk_rq_aligned would be incorrect, just possibly not
catching all issues.

But as Dave outlined I don't think it is a problem in any way.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22  7:59       ` Christoph Hellwig
@ 2019-08-22  8:51         ` Peter Zijlstra
  2019-08-22  9:10           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2019-08-22  8:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-xfs, Ingo Molnar, Will Deacon, linux-kernel,
	linux-mm, penguin-kernel

On Thu, Aug 22, 2019 at 12:59:48AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 10:31:32AM +1000, Dave Chinner wrote:
> > > Btw, I think we should eventually kill off KM_NOFS and just use
> > > PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> > > But that's something for the future.
> > 
> > Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
> > at high levels - we'll still need to annotate callers that use KM_NOFS
> > to avoid lockdep false positives. i.e. any code that can be called from
> > GFP_KERNEL and reclaim context will throw false positives from
> > lockdep if we don't annotate tehm correctly....
> 
> Oh well.  For now we have the XFS kmem_wrappers to turn that into
> GFP_NOFS so we shouldn't be too worried, but I think that is something
> we should fix in lockdep to ensure it is generally useful.  I've added
> the maintainers and relevant lists to kick off a discussion.

Strictly speaking the fs_reclaim annotation is no longer part of the
lockdep core, but is simply a fake lock in page_alloc.c and thus falls
under the mm people's purview.

That said; it should be fairly straight forward to teach
__need_fs_reclaim() about PF_MEMALLOC_NOFS, much like how it already
knows about PF_MEMALLOC.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22  8:51         ` Peter Zijlstra
@ 2019-08-22  9:10           ` Peter Zijlstra
  2019-08-22 10:14             ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2019-08-22  9:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-xfs, Ingo Molnar, Will Deacon, linux-kernel,
	linux-mm, penguin-kernel

On Thu, Aug 22, 2019 at 10:51:30AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 22, 2019 at 12:59:48AM -0700, Christoph Hellwig wrote:
> > On Thu, Aug 22, 2019 at 10:31:32AM +1000, Dave Chinner wrote:
> > > > Btw, I think we should eventually kill off KM_NOFS and just use
> > > > PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> > > > But that's something for the future.
> > > 
> > > Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
> > > at high levels - we'll still need to annotate callers that use KM_NOFS
> > > to avoid lockdep false positives. i.e. any code that can be called from
> > > GFP_KERNEL and reclaim context will throw false positives from
> > > lockdep if we don't annotate tehm correctly....
> > 
> > Oh well.  For now we have the XFS kmem_wrappers to turn that into
> > GFP_NOFS so we shouldn't be too worried, but I think that is something
> > we should fix in lockdep to ensure it is generally useful.  I've added
> > the maintainers and relevant lists to kick off a discussion.
> 
> Strictly speaking the fs_reclaim annotation is no longer part of the
> lockdep core, but is simply a fake lock in page_alloc.c and thus falls
> under the mm people's purview.
> 
> That said; it should be fairly straight forward to teach
> __need_fs_reclaim() about PF_MEMALLOC_NOFS, much like how it already
> knows about PF_MEMALLOC.

Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
into the GFP flags.

So are we sure it is broken and needs mending?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22  9:10           ` Peter Zijlstra
@ 2019-08-22 10:14             ` Dave Chinner
  2019-08-22 11:14               ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, linux-xfs, Ingo Molnar, Will Deacon,
	linux-kernel, linux-mm, penguin-kernel

On Thu, Aug 22, 2019 at 11:10:57AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 22, 2019 at 10:51:30AM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 22, 2019 at 12:59:48AM -0700, Christoph Hellwig wrote:
> > > On Thu, Aug 22, 2019 at 10:31:32AM +1000, Dave Chinner wrote:
> > > > > Btw, I think we should eventually kill off KM_NOFS and just use
> > > > > PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> > > > > But that's something for the future.
> > > > 
> > > > Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
> > > > at high levels - we'll still need to annotate callers that use KM_NOFS
> > > > to avoid lockdep false positives. i.e. any code that can be called from
> > > > GFP_KERNEL and reclaim context will throw false positives from
> > > > lockdep if we don't annotate tehm correctly....
> > > 
> > > Oh well.  For now we have the XFS kmem_wrappers to turn that into
> > > GFP_NOFS so we shouldn't be too worried, but I think that is something
> > > we should fix in lockdep to ensure it is generally useful.  I've added
> > > the maintainers and relevant lists to kick off a discussion.
> > 
> > Strictly speaking the fs_reclaim annotation is no longer part of the
> > lockdep core, but is simply a fake lock in page_alloc.c and thus falls
> > under the mm people's purview.
> > 
> > That said; it should be fairly straight forward to teach
> > __need_fs_reclaim() about PF_MEMALLOC_NOFS, much like how it already
> > knows about PF_MEMALLOC.
> 
> Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
> into the GFP flags.
> 
> So are we sure it is broken and needs mending?

Well, that's what we are trying to work out. The problem is that we
have code that takes locks and does allocations that is called both
above and below the reclaim "lock" context. Once it's been seen
below the reclaim lock context, calling it with GFP_KERNEL context
above the reclaim lock context throws a deadlock warning.

The only way around that was to mark these allocation sites as
GFP_NOFS so lockdep is never allowed to see that recursion through
reclaim occur. Even though it isn't a deadlock vector.

What we're looking at is whether PF_MEMALLOC_NOFS changes this - I
don't think it does solve this problem. i.e. if we define the
allocation as GFP_KERNEL and then use PF_MEMALLOC_NOFS where reclaim
is not allowed, we still have GFP_KERNEL allocations in code above
reclaim that has also been seen below relcaim. And so we'll get
false positive warnings again.

What I think we are going to have to do here is manually audit
each of the KM_NOFS call sites as we remove the NOFS from them and
determine if ___GFP_NOLOCKDEP is needed to stop lockdep from trying
to track these allocation sites. We've never used this tag because
we'd already fixed most of these false positives with explicit
GFP_NOFS tags long before ___GFP_NOLOCKDEP was created.

But until someone starts doing the work, I don't know if it will
work or even whether conversion PF_MEMALLOC_NOFS is going to
introduce a bunch of new ways to get false positives from lockdep...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  8:03       ` Christoph Hellwig
@ 2019-08-22 10:17         ` Dave Chinner
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 10:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 01:03:12AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 10:37:45AM +1000, Dave Chinner wrote:
> > > I know Jens disagree, but with the amount of bugs we've been hitting
> > > thangs to slub (and I'm pretty sure we have a more hiding outside of
> > > XFS) I think we need to add the blk_rq_aligned check to bio_add_page.
> > 
> > ... I'm not prepared to fight this battle to get this initial fix
> > into the code. Get the fix merged, then we can 
> 
> Well, the initial fix are the first two patches.  This patch really
> just adds a safety belt.  I'll happily take over the effort to get
> sensible checks in the block code if you give me a couple weeks,
> in the meantime I'd prefer if we could skip this third patch for now.

Fine by me. I'll just repost the current versions of the first two
patches (now three) in the morning.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22  8:08         ` Christoph Hellwig
@ 2019-08-22 10:20           ` Ming Lei
  2019-08-23  0:14             ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Ming Lei @ 2019-08-22 10:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ming Lei, open list:XFS FILESYSTEM, Jens Axboe,
	linux-block

On Thu, Aug 22, 2019 at 01:08:52AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 02:49:05PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 10:50:02AM +0800, Ming Lei wrote:
> > > It isn't correct to blk_rq_aligned() here because 'len' has to be logical block
> > > size aligned, instead of DMA aligned only.
> 
> Even if len would have to be a multiple of the sector size, that doesn't
> mean calling blk_rq_aligned would be incorrect, just possibly not
> catching all issues.

In theory, fs bio shouldn't care any DMA limits, which should have been done
on splitted bio for doing IO to device.

Also .dma_alignment isn't considered in blk_stack_limits(), so in case
of DM, MD or other stacking drivers, fs code won't know the accurate
.dma_alignment of underlying queues at all, and the stacking driver's
queue dma alignment is still 512.

Also suppose the check is added, I am a bit curious how fs code handles the
failure, so could you explain a bit about the failure handling?

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 10:14             ` Dave Chinner
@ 2019-08-22 11:14               ` Vlastimil Babka
  2019-08-22 12:07                 ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2019-08-22 11:14 UTC (permalink / raw)
  To: Dave Chinner, Peter Zijlstra
  Cc: Christoph Hellwig, linux-xfs, Ingo Molnar, Will Deacon,
	linux-kernel, linux-mm, penguin-kernel

On 8/22/19 12:14 PM, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 11:10:57AM +0200, Peter Zijlstra wrote:
>> 
>> Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
>> into the GFP flags.
>> 
>> So are we sure it is broken and needs mending?
> 
> Well, that's what we are trying to work out. The problem is that we
> have code that takes locks and does allocations that is called both
> above and below the reclaim "lock" context. Once it's been seen
> below the reclaim lock context, calling it with GFP_KERNEL context
> above the reclaim lock context throws a deadlock warning.
> 
> The only way around that was to mark these allocation sites as
> GFP_NOFS so lockdep is never allowed to see that recursion through
> reclaim occur. Even though it isn't a deadlock vector.
> 
> What we're looking at is whether PF_MEMALLOC_NOFS changes this - I
> don't think it does solve this problem. i.e. if we define the
> allocation as GFP_KERNEL and then use PF_MEMALLOC_NOFS where reclaim
> is not allowed, we still have GFP_KERNEL allocations in code above
> reclaim that has also been seen below relcaim. And so we'll get
> false positive warnings again.

If I understand both you and the code directly, the code sites won't call
__fs_reclaim_acquire when called with current->flags including PF_MEMALLOC_NOFS.
So that would mean they "won't be seen below the reclaim" and all would be fine,
right?

> What I think we are going to have to do here is manually audit
> each of the KM_NOFS call sites as we remove the NOFS from them and
> determine if ___GFP_NOLOCKDEP is needed to stop lockdep from trying
> to track these allocation sites. We've never used this tag because
> we'd already fixed most of these false positives with explicit
> GFP_NOFS tags long before ___GFP_NOLOCKDEP was created.
> 
> But until someone starts doing the work, I don't know if it will
> work or even whether conversion PF_MEMALLOC_NOFS is going to
> introduce a bunch of new ways to get false positives from lockdep...
> 
> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 11:14               ` Vlastimil Babka
@ 2019-08-22 12:07                 ` Dave Chinner
  2019-08-22 12:19                   ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 12:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christoph Hellwig, linux-xfs, Ingo Molnar,
	Will Deacon, linux-kernel, linux-mm, penguin-kernel

On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> On 8/22/19 12:14 PM, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 11:10:57AM +0200, Peter Zijlstra wrote:
> >> 
> >> Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
> >> into the GFP flags.
> >> 
> >> So are we sure it is broken and needs mending?
> > 
> > Well, that's what we are trying to work out. The problem is that we
> > have code that takes locks and does allocations that is called both
> > above and below the reclaim "lock" context. Once it's been seen
> > below the reclaim lock context, calling it with GFP_KERNEL context
> > above the reclaim lock context throws a deadlock warning.
> > 
> > The only way around that was to mark these allocation sites as
> > GFP_NOFS so lockdep is never allowed to see that recursion through
> > reclaim occur. Even though it isn't a deadlock vector.
> > 
> > What we're looking at is whether PF_MEMALLOC_NOFS changes this - I
> > don't think it does solve this problem. i.e. if we define the
> > allocation as GFP_KERNEL and then use PF_MEMALLOC_NOFS where reclaim
> > is not allowed, we still have GFP_KERNEL allocations in code above
> > reclaim that has also been seen below relcaim. And so we'll get
> > false positive warnings again.
> 
> If I understand both you and the code directly, the code sites won't call
> __fs_reclaim_acquire when called with current->flags including PF_MEMALLOC_NOFS.
> So that would mean they "won't be seen below the reclaim" and all would be fine,
> right?

No, the problem is this (using kmalloc as a general term for
allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)

   some random kernel code
    kmalloc(GFP_KERNEL)
     reclaim
     PF_MEMALLOC
     shrink_slab
      xfs_inode_shrink
       XFS_ILOCK
        xfs_buf_allocate_memory()
         kmalloc(GFP_KERNEL)

And so locks on inodes in reclaim are seen below reclaim. Then
somewhere else we have:

   some high level read-only xfs code like readdir
    XFS_ILOCK
     xfs_buf_allocate_memory()
      kmalloc(GFP_KERNEL)
       reclaim

And this one throws false positive lockdep warnings because we
called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
context. So the only solution we had at the tiem to shut it up was:

   some high level read-only xfs code like readdir
    XFS_ILOCK
     xfs_buf_allocate_memory()
      kmalloc(GFP_NOFS)

So that lockdep sees it's not going to recurse into reclaim and
doesn't throw a warning...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 12:07                 ` Dave Chinner
@ 2019-08-22 12:19                   ` Vlastimil Babka
  2019-08-22 13:17                     ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2019-08-22 12:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Christoph Hellwig, linux-xfs, Ingo Molnar,
	Will Deacon, linux-kernel, linux-mm, penguin-kernel

On 8/22/19 2:07 PM, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> 
> No, the problem is this (using kmalloc as a general term for
> allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)
> 
>    some random kernel code
>     kmalloc(GFP_KERNEL)
>      reclaim
>      PF_MEMALLOC
>      shrink_slab
>       xfs_inode_shrink
>        XFS_ILOCK
>         xfs_buf_allocate_memory()
>          kmalloc(GFP_KERNEL)
> 
> And so locks on inodes in reclaim are seen below reclaim. Then
> somewhere else we have:
> 
>    some high level read-only xfs code like readdir
>     XFS_ILOCK
>      xfs_buf_allocate_memory()
>       kmalloc(GFP_KERNEL)
>        reclaim
> 
> And this one throws false positive lockdep warnings because we
> called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc

OK, and what exactly makes this positive a false one? Why can't it continue like
the first example where reclaim leads to another XFS_ILOCK, thus deadlock?

> context. So the only solution we had at the tiem to shut it up was:
> 
>    some high level read-only xfs code like readdir
>     XFS_ILOCK
>      xfs_buf_allocate_memory()
>       kmalloc(GFP_NOFS)
> 
> So that lockdep sees it's not going to recurse into reclaim and
> doesn't throw a warning...

AFAICS that GFP_NOFS would fix not only a warning but also a real deadlock
(depending on the answer to my previous question).

> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 12:19                   ` Vlastimil Babka
@ 2019-08-22 13:17                     ` Dave Chinner
  2019-08-22 14:26                       ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 13:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Zijlstra, Christoph Hellwig, linux-xfs, Ingo Molnar,
	Will Deacon, linux-kernel, linux-mm, penguin-kernel

On Thu, Aug 22, 2019 at 02:19:04PM +0200, Vlastimil Babka wrote:
> On 8/22/19 2:07 PM, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> > 
> > No, the problem is this (using kmalloc as a general term for
> > allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)
> > 
> >    some random kernel code
> >     kmalloc(GFP_KERNEL)
> >      reclaim
> >      PF_MEMALLOC
> >      shrink_slab
> >       xfs_inode_shrink
> >        XFS_ILOCK
> >         xfs_buf_allocate_memory()
> >          kmalloc(GFP_KERNEL)
> > 
> > And so locks on inodes in reclaim are seen below reclaim. Then
> > somewhere else we have:
> > 
> >    some high level read-only xfs code like readdir
> >     XFS_ILOCK
> >      xfs_buf_allocate_memory()
> >       kmalloc(GFP_KERNEL)
> >        reclaim
> > 
> > And this one throws false positive lockdep warnings because we
> > called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
> 
> OK, and what exactly makes this positive a false one? Why can't it continue like
> the first example where reclaim leads to another XFS_ILOCK, thus deadlock?

Because above reclaim we only have operations being done on
referenced inodes, and below reclaim we only have unreferenced
inodes. We never lock the same inode both above and below reclaim
at the same time.

IOWs, an operation above reclaim cannot see, access or lock
unreferenced inodes, except in inode write clustering, and that uses
trylocks so cannot deadlock with reclaim.

An operation below reclaim cannot see, access or lock referenced
inodes except during inode write clustering, and that uses trylocks
so cannot deadlock with code above reclaim.

FWIW, I'm trying to make the inode writeback clustering go away from
reclaim at the moment, so even that possibility is going away soon.
That will change everything to trylocks in reclaim context, so
lockdep is going to stop tracking it entirely.

Hmmm - maybe we're getting to the point where we actually
don't need GFP_NOFS/PF_MEMALLOC_NOFS at all in XFS anymore.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-21 21:14     ` Dave Chinner
@ 2019-08-22 13:40       ` Brian Foster
  2019-08-22 22:39         ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2019-08-22 13:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Aug 22, 2019 at 07:14:52AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Memory we use to submit for IO needs strict alignment to the
> > > underlying driver contraints. Worst case, this is 512 bytes. Given
> > > that all allocations for IO are always a power of 2 multiple of 512
> > > bytes, the kernel heap provides natural alignment for objects of
> > > these sizes and that suffices.
> > > 
> > > Until, of course, memory debugging of some kind is turned on (e.g.
> > > red zones, poisoning, KASAN) and then the alignment of the heap
> > > objects is thrown out the window. Then we get weird IO errors and
> > > data corruption problems because drivers don't validate alignment
> > > and do the wrong thing when passed unaligned memory buffers in bios.
> > > 
> > > TO fix this, introduce kmem_alloc_io(), which will guaranteeat least
> > 
> > s/TO/To/
> > 
> > > 512 byte alignment of buffers for IO, even if memory debugging
> > > options are turned on. It is assumed that the minimum allocation
> > > size will be 512 bytes, and that sizes will be power of 2 mulitples
> > > of 512 bytes.
> > > 
> > > Use this everywhere we allocate buffers for IO.
> > > 
> > > This no longer fails with log recovery errors when KASAN is enabled
> > > due to the brd driver not handling unaligned memory buffers:
> > > 
> > > # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/kmem.c            | 61 +++++++++++++++++++++++++++++-----------
> > >  fs/xfs/kmem.h            |  1 +
> > >  fs/xfs/xfs_buf.c         |  4 +--
> > >  fs/xfs/xfs_log.c         |  2 +-
> > >  fs/xfs/xfs_log_recover.c |  2 +-
> > >  fs/xfs/xfs_trace.h       |  1 +
> > >  6 files changed, 50 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index edcf393c8fd9..ec693c0fdcff 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > ...
> > > @@ -62,6 +56,39 @@ kmem_alloc_large(size_t size, xfs_km_flags_t flags)
> > >  	return ptr;
> > >  }
> > >  
> > > +/*
> > > + * Same as kmem_alloc_large, except we guarantee a 512 byte aligned buffer is
> > > + * returned. vmalloc always returns an aligned region.
> > > + */
> > > +void *
> > > +kmem_alloc_io(size_t size, xfs_km_flags_t flags)
> > > +{
> > > +	void	*ptr;
> > > +
> > > +	trace_kmem_alloc_io(size, flags, _RET_IP_);
> > > +
> > > +	ptr = kmem_alloc(size, flags | KM_MAYFAIL);
> > > +	if (ptr) {
> > > +		if (!((long)ptr & 511))
> > > +			return ptr;
> > > +		kfree(ptr);
> > > +	}
> > > +	return __kmem_vmalloc(size, flags);
> > > +}
> > 
> > Even though it is unfortunate, this seems like a quite reasonable and
> > isolated temporary solution to the problem to me. The one concern I have
> > is if/how much this could affect performance under certain
> > circumstances.
> 
> Can't measure a difference on 4k block size filesystems. It's only
> used for log recovery and then for allocation AGF/AGI buffers on
> 512 byte sector devices. Anything using 4k sectors only hits it
> during mount. So for default configs with memory posioning/KASAN
> enabled, the massive overhead of poisoning/tracking makes this
> disappear in the noise.
> 
> For 1k block size filesystems, it gets hit much harder, but
> there's no noticable increase in runtime of xfstests vs 4k block
> size with KASAN enabled. The big increase in overhead comes from
> enabling KASAN (takes 3x longer than without), not doing one extra
> allocation/free pair.
> 
> > I realize that these callsites are isolated in the common
> > scenario. Less common scenarios like sub-page block sizes (whether due
> > to explicit mkfs time format or default configurations on larger page
> > size systems) can fall into this path much more frequently, however.
> 
> *nod*
> 
> > Since this implies some kind of vm debug option is enabled, performance
> > itself isn't critical when this solution is active. But how bad is it in
> > those cases where we might depend on this more heavily? Have you
> > confirmed that the end configuration is still "usable," at least?
> 
> No noticable difference, most definitely still usable. 
> 

Ok, thanks.

> > I ask because the repeated alloc/free behavior can easily be avoided via
> > something like an mp flag (which may require a tweak to the
> 
> What's an "mp flag"?
> 

A bool or something similar added to xfs_mount to control further slab
allocation attempts for I/O allocations for this mount. I was just
throwing this out there if the performance hit happened to be bad (on
top of whatever vm debug option is enabled) in those configurations
where slab based buffer allocations are more common. If the performance
hit is negligible in practice, then I'm not worried about it.

> > kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> > path once we see one unaligned allocation. That assumes this behavior is
> > tied to functionality that isn't dynamically configured at runtime, of
> > course.
> 
> vmalloc() has a _lot_ more overhead than kmalloc (especially when
> vmalloc has to do multiple allocations itself to set up page table
> entries) so there is still an overall gain to be had by trying
> kmalloc even if 50% of allocations are unaligned.
> 

I had the impression that this unaligned allocation behavior is tied to
enablement of debug options that otherwise aren't enabled/disabled
dynamically. Hence, the unaligned allocation behavior is persistent for
a particular mount and repeated attempts are pointless once we see at
least one such result. Is that not the case?

Again, I don't think performance is a huge deal so long as testing shows
that an fs is still usable with XFS running this kind of allocation
pattern. In thinking further about it, aren't we essentially bypassing
these tools for associated allocations if they don't offer similar
functionality for vmalloc allocations? It might be worth 1.) noting that
as a consequence of this change in the commit log and 2.) having a
oneshot warning somewhere when we initially hit this problem so somebody
actually using one of these tools realizes that enabling it actually
changes allocation behavior. For example:

XFS ...: WARNING: Unaligned I/O memory allocation. VM debug enabled?
Disabling slab allocations for I/O.

... or alternatively just add a WARN_ON_ONCE() or something with a
similar comment in the code.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-21 21:39     ` Dave Chinner
@ 2019-08-22 13:47       ` Brian Foster
  2019-08-22 23:03         ` Dave Chinner
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2019-08-22 13:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Add memory buffer alignment validation checks to bios built in XFS
> > > to catch bugs that will result in silent data corruption in block
> > > drivers that cannot handle unaligned memory buffers but don't
> > > validate the incoming buffer alignment is correct.
> > > 
> > > Known drivers with these issues are xenblk, brd and pmem.
> > > 
> > > Despite there being nothing XFS specific to xfs_bio_add_page(), this
> > > function was created to do the required validation because the block
> > > layer developers that keep telling us that is not possible to
> > > validate buffer alignment in bio_add_page(), and even if it was
> > > possible it would be too much overhead to do at runtime.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
> > >  fs/xfs/xfs_buf.c    |  2 +-
> > >  fs/xfs/xfs_linux.h  |  3 +++
> > >  fs/xfs/xfs_log.c    |  6 +++++-
> > >  4 files changed, 38 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> > > index e2148f2d5d6b..fbaea643c000 100644
> > > --- a/fs/xfs/xfs_bio_io.c
> > > +++ b/fs/xfs/xfs_bio_io.c
> > > @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
> > >  	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
> > >  }
> > >  
> > > +int
> > > +xfs_bio_add_page(
> > > +	struct bio	*bio,
> > > +	struct page	*page,
> > > +	unsigned int	len,
> > > +	unsigned int	offset)
> > > +{
> > > +	struct request_queue	*q = bio->bi_disk->queue;
> > > +	bool		same_page = false;
> > > +
> > > +	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
> > > +		return -EIO;
> > > +
> > > +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> > > +		if (bio_full(bio, len))
> > > +			return 0;
> > > +		__bio_add_page(bio, page, len, offset);
> > > +	}
> > > +	return len;
> > > +}
> > > +
> > 
> > Seems reasonable to me. Looks like bio_add_page() with an error check.
> > ;)
> 
> It is exactly that.
> 
> > Given that, what's the need to open-code bio_add_page() here rather
> > than just call it after the check?
> 
> Largely because I wasn't sure exactly where the best place was to
> add the check. Copy first, hack to pieces, make work, then worry
> about other stuff :)
> 
> I could change it to calling calling bio_add_page() directly. Not
> really that fussed as it's all exported functionality...
> 
> 
> > >  int
> > >  xfs_rw_bdev(
> > >  	struct block_device	*bdev,
> > ...
> > > @@ -36,9 +57,12 @@ xfs_rw_bdev(
> > >  		unsigned int	off = offset_in_page(data);
> > >  		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
> > >  
> > > -		while (bio_add_page(bio, page, len, off) != len) {
> > > +		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
> > >  			struct bio	*prev = bio;
> > >  
> > > +			if (ret < 0)
> > > +				goto submit;
> > > +
> > 
> > Hmm.. is submitting the bio really the right thing to do if we get here
> > and have failed to add any pages to the bio?
> 
> Yes, because we really should wait for chained bios to complete
> before returning. we can only do that by waiting on completion for
> the entire chain, and the simplest way to do that is submit the
> bio...
> 

Ok. I guess that makes sense here because these are sync I/Os and this
appears to be the only way to wait on a partially constructed chain
(that I can tell).

> > If we're already seeing
> > weird behavior for bios with unaligned data memory, this seems like a
> > recipe for similar weirdness. We'd also end up doing a partial write in
> > scenarios where we already know we're returning an error.
> 
> THe partial write will happen anyway on a chained bio, something we
> know already happens from the bug in bio chaining that was found in
> this code earlier in the cycle.
> 

Sure, a failure in a bio chain is essentially a partial write similar to
if one such bio in the chain had failed. What about the non-chaining
case? What happens on submission of an empty read/write bio? Does
behavior depend on the underlying storage stack?

Note that I see this patch is likely going away. I'm just saying I think
we should have some confirmation on how the block layer behaves in these
situations before we take an approach that relies on it. Is an empty bio
essentially a no-op that serves as a serialization mechanism? Does the
block layer return an error? Etc.

> > Perhaps we
> > should create an error path or use a check similar to what is already in
> > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > when we already know we're going to return an error) to call bio_endio()
> > to undo any chaining.
> 
> xfs_buf_ioapply_map() only fails with an error if it occurs on the
> first bio_add_page() call to a bio. If it fails on the second, then
> it submits the bio, allocates a new bio, and tries again. If that
> fails, then it frees the bio and does error processing.
> 
> That code can get away with such behaviour because it is not
> chaining bios. each bio is it's own IO, and failures there already
> result in partial IO completion with an error onthe buffer. This
> code here will fail with a partial IO completion and return an error.
> 

Right, that is the behavior I was referring to above.

> So, in effect, this code provides exactly the same behaviour and
> result as the xfs_buf_ioapply_map() code does...
> 

In this case, we return an error that the caller will handle and
shutdown the fs. That's reasonable behavior here since these are sync
I/Os.

> > 
> > >  			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
> > >  			bio_copy_dev(bio, prev);
> > >  			bio->bi_iter.bi_sector = bio_end_sector(prev);
> > ...
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 7bd1f31febfc..a2d499baee9c 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > >  		if (nbytes > size)
> > >  			nbytes = size;
> > >  
> > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > >  				      offset);
> > 
> > Similar concern here. The higher level code seems written under the
> > assumption that bio_add_page() returns success or zero. In this case the
> > error is basically tossed so we can also attempt to split/chain an empty
> > bio, or even better, submit a partial write and continue operating as if
> > nothing happened (outside of the warning). The latter case should be
> > handled as a log I/O error one way or another.
> 
> See above - it does result in a failure when offset/nbytes is not
> aligned to the underlying device...
> 

Oops, this comment was misplaced. I meant to write this in response to
the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
the note about handling log I/O errors above).

The former has no such error checks that I can see. AFAICT, if we
construct a partial or empty bio here, we'd warn and return.
xlog_map_iclog_data() is a void function so the caller has no indication
anything went wrong and submits the bio. If an empty (?) or partial bio
doesn't return an error on bio completion, we've completely failed to
account for the fact that we've written a partial iclog to disk.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 13:17                     ` Dave Chinner
@ 2019-08-22 14:26                       ` Vlastimil Babka
  2019-08-26 12:21                         ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2019-08-22 14:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Christoph Hellwig, linux-xfs, Ingo Molnar,
	Will Deacon, linux-kernel, linux-mm, penguin-kernel

On 8/22/19 3:17 PM, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 02:19:04PM +0200, Vlastimil Babka wrote:
>> On 8/22/19 2:07 PM, Dave Chinner wrote:
>> > On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
>> > 
>> > No, the problem is this (using kmalloc as a general term for
>> > allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)
>> > 
>> >    some random kernel code
>> >     kmalloc(GFP_KERNEL)
>> >      reclaim
>> >      PF_MEMALLOC
>> >      shrink_slab
>> >       xfs_inode_shrink
>> >        XFS_ILOCK
>> >         xfs_buf_allocate_memory()
>> >          kmalloc(GFP_KERNEL)
>> > 
>> > And so locks on inodes in reclaim are seen below reclaim. Then
>> > somewhere else we have:
>> > 
>> >    some high level read-only xfs code like readdir
>> >     XFS_ILOCK
>> >      xfs_buf_allocate_memory()
>> >       kmalloc(GFP_KERNEL)
>> >        reclaim
>> > 
>> > And this one throws false positive lockdep warnings because we
>> > called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
>> 
>> OK, and what exactly makes this positive a false one? Why can't it continue like
>> the first example where reclaim leads to another XFS_ILOCK, thus deadlock?
> 
> Because above reclaim we only have operations being done on
> referenced inodes, and below reclaim we only have unreferenced
> inodes. We never lock the same inode both above and below reclaim
> at the same time.
> 
> IOWs, an operation above reclaim cannot see, access or lock
> unreferenced inodes, except in inode write clustering, and that uses
> trylocks so cannot deadlock with reclaim.
> 
> An operation below reclaim cannot see, access or lock referenced
> inodes except during inode write clustering, and that uses trylocks
> so cannot deadlock with code above reclaim.

Thanks for elaborating. Perhaps lockdep experts (not me) would know how to
express that. If not possible, then replacing GFP_NOFS with __GFP_NOLOCKDEP
should indeed suppress the warning, while allowing FS reclaim.

> FWIW, I'm trying to make the inode writeback clustering go away from
> reclaim at the moment, so even that possibility is going away soon.
> That will change everything to trylocks in reclaim context, so
> lockdep is going to stop tracking it entirely.

That's also a nice solution :)

> Hmmm - maybe we're getting to the point where we actually
> don't need GFP_NOFS/PF_MEMALLOC_NOFS at all in XFS anymore.....
> 
> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 13:40       ` Brian Foster
@ 2019-08-22 22:39         ` Dave Chinner
  2019-08-23 12:10           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 22:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 22, 2019 at 09:40:17AM -0400, Brian Foster wrote:
> On Thu, Aug 22, 2019 at 07:14:52AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> > > path once we see one unaligned allocation. That assumes this behavior is
> > > tied to functionality that isn't dynamically configured at runtime, of
> > > course.
> > 
> > vmalloc() has a _lot_ more overhead than kmalloc (especially when
> > vmalloc has to do multiple allocations itself to set up page table
> > entries) so there is still an overall gain to be had by trying
> > kmalloc even if 50% of allocations are unaligned.
> > 
> 
> I had the impression that this unaligned allocation behavior is tied to
> enablement of debug options that otherwise aren't enabled/disabled
> dynamically. Hence, the unaligned allocation behavior is persistent for
> a particular mount and repeated attempts are pointless once we see at
> least one such result. Is that not the case?

The alignment for a given slab is likely to be persistent, but it
will be different for different sized slabs. e.g. I saw 128 offsets
for 512 slabs, and 1024 byte offsets for 4k slabs. The 1024 byte
offsets worked just fine (because multiple of 512 bytes!) but the
128 byte ones didn't.

Hence it's not a black and white "everythign is unaligned and
unsupportable" situation, nor is the alignment necessarily an issue
for the underlying driver. e.g. most scsi and nvme handle 8
byte alignment of buffers, and if the buffer is not aligned they
bounce it (detected via the same blk_rq_alignment() check I added)
and can still do the IO anyway. So a large amount of the IO stack
just doesn't care about user buffers being unaligned....

> Again, I don't think performance is a huge deal so long as testing shows
> that an fs is still usable with XFS running this kind of allocation
> pattern.

It's added 10 minutes to the runtime of a full auto run with KASAN
enabled on pmem. To put that in context, the entire run took:

real    461m36.831s
user    44m31.779s
sys     708m37.467s

More than 7.5 hours to complete, so ten minutes here or there is
noise.

> In thinking further about it, aren't we essentially bypassing
> these tools for associated allocations if they don't offer similar
> functionality for vmalloc allocations?

kasan uses guard pages around vmalloc allocations to detect out of
bound accesses. It still tracks the page allocations, etc, so we
still get use after free tracking, etc. i.e. AFAICT we don't
actually lose any debugging functonality by using vmalloc here.

> It might be worth 1.) noting that
> as a consequence of this change in the commit log and 2.) having a
> oneshot warning somewhere when we initially hit this problem so somebody
> actually using one of these tools realizes that enabling it actually
> changes allocation behavior. For example:
> 
> XFS ...: WARNING: Unaligned I/O memory allocation. VM debug enabled?
> Disabling slab allocations for I/O.
> 
> ... or alternatively just add a WARN_ON_ONCE() or something with a
> similar comment in the code.

Well, the WARN_ON_ONCE is in xfs_bio_add_page() when it detects an
invalid alignment. So we'll get this warning on production systems
as well as debug/test systems. I think that's the important case to
catch, because misalignment will result in silent data corruption...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22 13:47       ` Brian Foster
@ 2019-08-22 23:03         ` Dave Chinner
  2019-08-23 12:33           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2019-08-22 23:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 22, 2019 at 09:47:16AM -0400, Brian Foster wrote:
> On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > Yes, because we really should wait for chained bios to complete
> > before returning. we can only do that by waiting on completion for
> > the entire chain, and the simplest way to do that is submit the
> > bio...
> > 
> 
> Ok. I guess that makes sense here because these are sync I/Os and this
> appears to be the only way to wait on a partially constructed chain
> (that I can tell).

Same here. I can't find any code that manually waits for a partially
submitted chain to complete...

> > > If we're already seeing
> > > weird behavior for bios with unaligned data memory, this seems like a
> > > recipe for similar weirdness. We'd also end up doing a partial write in
> > > scenarios where we already know we're returning an error.
> > 
> > THe partial write will happen anyway on a chained bio, something we
> > know already happens from the bug in bio chaining that was found in
> > this code earlier in the cycle.
> > 
> 
> Sure, a failure in a bio chain is essentially a partial write similar to
> if one such bio in the chain had failed. What about the non-chaining
> case? What happens on submission of an empty read/write bio? Does
> behavior depend on the underlying storage stack?

Seems to work just fine when I tested the code without the aligned
allocation patch. I do note that the block layer does allow zero
length (non-data) bios for things like flush requests.
Unsurprisingly, theres no checks to disallow completely empty bios
from being submitted, so I'm asumming that it's handled just fine
given the code worked when tested...

> Note that I see this patch is likely going away. I'm just saying I think
> we should have some confirmation on how the block layer behaves in these
> situations before we take an approach that relies on it. Is an empty bio
> essentially a no-op that serves as a serialization mechanism? Does the
> block layer return an error? Etc.

submit_bio_wait() didn't return an error, and even if it did it gets
overridden by the error from misalignment.

> > > Perhaps we
> > > should create an error path or use a check similar to what is already in
> > > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > > when we already know we're going to return an error) to call bio_endio()
> > > to undo any chaining.

bio_endio() doesn't wait for chaining. If there's children chained,
it just returns immediately and that leads to the use-after-free
situation I described to Christoph...

bio chaining just seems incredibly fragile in the face of errors
during chain building...

> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > > >  		if (nbytes > size)
> > > >  			nbytes = size;
> > > >  
> > > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > >  				      offset);
> > > 
> > > Similar concern here. The higher level code seems written under the
> > > assumption that bio_add_page() returns success or zero. In this case the
> > > error is basically tossed so we can also attempt to split/chain an empty
> > > bio, or even better, submit a partial write and continue operating as if
> > > nothing happened (outside of the warning). The latter case should be
> > > handled as a log I/O error one way or another.
> > 
> > See above - it does result in a failure when offset/nbytes is not
> > aligned to the underlying device...
> > 
> 
> Oops, this comment was misplaced. I meant to write this in response to
> the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
> the note about handling log I/O errors above).
> 
> The former has no such error checks that I can see. AFAICT, if we
> construct a partial or empty bio here, we'd warn and return.
> xlog_map_iclog_data() is a void function so the caller has no indication
> anything went wrong and submits the bio. If an empty (?) or partial bio
> doesn't return an error on bio completion, we've completely failed to
> account for the fact that we've written a partial iclog to disk.

This is more a "make sure we don't get stuck in an endless loop"
situation. If we've got unaligned iclogs, then a large, multi-page
allocation (32-256k) has failed to be page aligned. This is not
memory that comes from the slab, so alignment problems here indicate
a major problem with the page allocation and/or vmalloc code. It
just should never happen and if it does, then log writes failing are
going to be the least of our worries.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22 10:20           ` Ming Lei
@ 2019-08-23  0:14             ` Christoph Hellwig
  2019-08-23  1:19               ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2019-08-23  0:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Dave Chinner, Ming Lei,
	open list:XFS FILESYSTEM, Jens Axboe, linux-block

On Thu, Aug 22, 2019 at 06:20:00PM +0800, Ming Lei wrote:
> In theory, fs bio shouldn't care any DMA limits, which should have been done
> on splitted bio for doing IO to device.
> 
> Also .dma_alignment isn't considered in blk_stack_limits(), so in case
> of DM, MD or other stacking drivers, fs code won't know the accurate
> .dma_alignment of underlying queues at all, and the stacking driver's
> queue dma alignment is still 512.

Trying to handling alignment lower down means bounce buffering, so I
don't think trying to hndle it is a sane idea.  I'd be much happier to
say non-passthrough bios need 512 byte alignment, period.  That should
cover all the sane cases and we can easily check for it.  The occasional
device that would need larger alignment just needs to deal with it.

> Also suppose the check is added, I am a bit curious how fs code handles the
> failure, so could you explain a bit about the failure handling?

Even just an assert is a a start.  But a bio_add_page variant with
saner return value semantic would be helpful, and I have some ideas
there that I need to try out first.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-23  0:14             ` Christoph Hellwig
@ 2019-08-23  1:19               ` Ming Lei
  0 siblings, 0 replies; 44+ messages in thread
From: Ming Lei @ 2019-08-23  1:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ming Lei, open list:XFS FILESYSTEM, Jens Axboe,
	linux-block

On Thu, Aug 22, 2019 at 05:14:40PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2019 at 06:20:00PM +0800, Ming Lei wrote:
> > In theory, fs bio shouldn't care any DMA limits, which should have been done
> > on splitted bio for doing IO to device.
> > 
> > Also .dma_alignment isn't considered in blk_stack_limits(), so in case
> > of DM, MD or other stacking drivers, fs code won't know the accurate
> > .dma_alignment of underlying queues at all, and the stacking driver's
> > queue dma alignment is still 512.
> 
> Trying to handling alignment lower down means bounce buffering, so I
> don't think trying to hndle it is a sane idea.  I'd be much happier to
> say non-passthrough bios need 512 byte alignment, period.  That should
> cover all the sane cases and we can easily check for it.  The occasional
> device that would need larger alignment just needs to deal with it.

Yeah, I agree we need to avoid bounce buffer, and it is fine to check
512 simply.

Also we should consider the interface/protocol between fs and block layer,
it could make both sides happy to always align offset & length with logical
block size. And that is reasonable for fs bio.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 22:39         ` Dave Chinner
@ 2019-08-23 12:10           ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2019-08-23 12:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Aug 23, 2019 at 08:39:59AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 09:40:17AM -0400, Brian Foster wrote:
> > On Thu, Aug 22, 2019 at 07:14:52AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 21, 2019 at 09:35:33AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 21, 2019 at 06:38:19PM +1000, Dave Chinner wrote:
> > > > kmem_alloc_io() interface) to skip further kmem_alloc() calls from this
> > > > path once we see one unaligned allocation. That assumes this behavior is
> > > > tied to functionality that isn't dynamically configured at runtime, of
> > > > course.
> > > 
> > > vmalloc() has a _lot_ more overhead than kmalloc (especially when
> > > vmalloc has to do multiple allocations itself to set up page table
> > > entries) so there is still an overall gain to be had by trying
> > > kmalloc even if 50% of allocations are unaligned.
> > > 
> > 
> > I had the impression that this unaligned allocation behavior is tied to
> > enablement of debug options that otherwise aren't enabled/disabled
> > dynamically. Hence, the unaligned allocation behavior is persistent for
> > a particular mount and repeated attempts are pointless once we see at
> > least one such result. Is that not the case?
> 
> The alignment for a given slab is likely to be persistent, but it
> will be different for different sized slabs. e.g. I saw 128 offsets
> for 512 slabs, and 1024 byte offsets for 4k slabs. The 1024 byte
> offsets worked just fine (because multiple of 512 bytes!) but the
> 128 byte ones didn't.
> 

Interesting. I wasn't aware offsets could be large enough to maintain
alignment requirements.

> Hence it's not a black and white "everythign is unaligned and
> unsupportable" situation, nor is the alignment necessarily an issue
> for the underlying driver. e.g. most scsi and nvme handle 8
> byte alignment of buffers, and if the buffer is not aligned they
> bounce it (detected via the same blk_rq_alignment() check I added)
> and can still do the IO anyway. So a large amount of the IO stack
> just doesn't care about user buffers being unaligned....
> 
> > Again, I don't think performance is a huge deal so long as testing shows
> > that an fs is still usable with XFS running this kind of allocation
> > pattern.
> 
> It's added 10 minutes to the runtime of a full auto run with KASAN
> enabled on pmem. To put that in context, the entire run took:
> 
> real    461m36.831s
> user    44m31.779s
> sys     708m37.467s
> 
> More than 7.5 hours to complete, so ten minutes here or there is
> noise.
> 

Agreed. Thanks for the data.

> > In thinking further about it, aren't we essentially bypassing
> > these tools for associated allocations if they don't offer similar
> > functionality for vmalloc allocations?
> 
> kasan uses guard pages around vmalloc allocations to detect out of
> bound accesses. It still tracks the page allocations, etc, so we
> still get use after free tracking, etc. i.e. AFAICT we don't
> actually lose any debugging functonality by using vmalloc here.
> 

Ok.

> > It might be worth 1.) noting that
> > as a consequence of this change in the commit log and 2.) having a
> > oneshot warning somewhere when we initially hit this problem so somebody
> > actually using one of these tools realizes that enabling it actually
> > changes allocation behavior. For example:
> > 
> > XFS ...: WARNING: Unaligned I/O memory allocation. VM debug enabled?
> > Disabling slab allocations for I/O.
> > 
> > ... or alternatively just add a WARN_ON_ONCE() or something with a
> > similar comment in the code.
> 
> Well, the WARN_ON_ONCE is in xfs_bio_add_page() when it detects an
> invalid alignment. So we'll get this warning on production systems
> as well as debug/test systems. I think that's the important case to
> catch, because misalignment will result in silent data corruption...
> 

That's in the subsequent patch that I thought was being dropped (or more
accurately, deferred until Christoph has a chance to try and get it
fixed in the block layer..)..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] xfs: alignment check bio buffers
  2019-08-22 23:03         ` Dave Chinner
@ 2019-08-23 12:33           ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2019-08-23 12:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Aug 23, 2019 at 09:03:57AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 09:47:16AM -0400, Brian Foster wrote:
> > On Thu, Aug 22, 2019 at 07:39:30AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 21, 2019 at 09:39:04AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 21, 2019 at 06:38:20PM +1000, Dave Chinner wrote:
> > > Yes, because we really should wait for chained bios to complete
> > > before returning. we can only do that by waiting on completion for
> > > the entire chain, and the simplest way to do that is submit the
> > > bio...
> > > 
> > 
> > Ok. I guess that makes sense here because these are sync I/Os and this
> > appears to be the only way to wait on a partially constructed chain
> > (that I can tell).
> 
> Same here. I can't find any code that manually waits for a partially
> submitted chain to complete...
> 
> > > > If we're already seeing
> > > > weird behavior for bios with unaligned data memory, this seems like a
> > > > recipe for similar weirdness. We'd also end up doing a partial write in
> > > > scenarios where we already know we're returning an error.
> > > 
> > > THe partial write will happen anyway on a chained bio, something we
> > > know already happens from the bug in bio chaining that was found in
> > > this code earlier in the cycle.
> > > 
> > 
> > Sure, a failure in a bio chain is essentially a partial write similar to
> > if one such bio in the chain had failed. What about the non-chaining
> > case? What happens on submission of an empty read/write bio? Does
> > behavior depend on the underlying storage stack?
> 
> Seems to work just fine when I tested the code without the aligned
> allocation patch. I do note that the block layer does allow zero
> length (non-data) bios for things like flush requests.
> Unsurprisingly, theres no checks to disallow completely empty bios
> from being submitted, so I'm asumming that it's handled just fine
> given the code worked when tested...
> 

Yeah, I'm aware of flush requests and such. That's why I was asking
about empty read/write bios, since an empty bio by itself is not some
blatant error.

What exactly does "it works" mean here, though? Does an empty read/write
bio return with an error or just behave as a serialization no-op? It
might not matter for this particular sync I/O case, but it could make a
difference for others.

> > Note that I see this patch is likely going away. I'm just saying I think
> > we should have some confirmation on how the block layer behaves in these
> > situations before we take an approach that relies on it. Is an empty bio
> > essentially a no-op that serves as a serialization mechanism? Does the
> > block layer return an error? Etc.
> 
> submit_bio_wait() didn't return an error, and even if it did it gets
> overridden by the error from misalignment.
> 

I wrote "block layer," not submit_bio_wait() specifically. I'm asking
whether the bio might complete with an error or not.

> > > > Perhaps we
> > > > should create an error path or use a check similar to what is already in
> > > > xfs_buf_ioapply_map() (though I'm not a fan of submitting a partial I/O
> > > > when we already know we're going to return an error) to call bio_endio()
> > > > to undo any chaining.
> 
> bio_endio() doesn't wait for chaining. If there's children chained,
> it just returns immediately and that leads to the use-after-free
> situation I described to Christoph...
> 
> bio chaining just seems incredibly fragile in the face of errors
> during chain building...
> 

(The above is from two mails ago. Not sure why you're replying here..?)
Do note that bio_remaining_done() has special case chaining logic,
though. It's not just a no-op as implied. It looks to me that it behaves
as if the I/O were submitted and completed (async), but I could easily
be missing something as I'm not familiar with the code..

> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -1294,7 +1294,7 @@ xfs_buf_ioapply_map(
> > > > >  		if (nbytes > size)
> > > > >  			nbytes = size;
> > > > >  
> > > > > -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > > +		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
> > > > >  				      offset);
> > > > 
> > > > Similar concern here. The higher level code seems written under the
> > > > assumption that bio_add_page() returns success or zero. In this case the
> > > > error is basically tossed so we can also attempt to split/chain an empty
> > > > bio, or even better, submit a partial write and continue operating as if
> > > > nothing happened (outside of the warning). The latter case should be
> > > > handled as a log I/O error one way or another.
> > > 
> > > See above - it does result in a failure when offset/nbytes is not
> > > aligned to the underlying device...
> > > 
> > 
> > Oops, this comment was misplaced. I meant to write this in response to
> > the changes to xlog_map_iclog_data(), not xfs_buf_ioapply_map() (hence
> > the note about handling log I/O errors above).
> > 
> > The former has no such error checks that I can see. AFAICT, if we
> > construct a partial or empty bio here, we'd warn and return.
> > xlog_map_iclog_data() is a void function so the caller has no indication
> > anything went wrong and submits the bio. If an empty (?) or partial bio
> > doesn't return an error on bio completion, we've completely failed to
> > account for the fact that we've written a partial iclog to disk.
> 
> This is more a "make sure we don't get stuck in an endless loop"
> situation. If we've got unaligned iclogs, then a large, multi-page
> allocation (32-256k) has failed to be page aligned. This is not
> memory that comes from the slab, so alignment problems here indicate
> a major problem with the page allocation and/or vmalloc code. It
> just should never happen and if it does, then log writes failing are
> going to be the least of our worries.
> 

There is no infinite loop vector here. That aside, the iclog data buffer
is not a multi-page allocation on all systems. And FWIW, I don't think
all multi-page allocations are non-slab (slab/slub/slob appear to have
different levels of multi-page support, though that probably doesn't
matter on 4k systems for our supported iclog sizes). Otherwise, we're
just making assumptions on current and future failure semantics of an
external subsystem that's out of our control, and doing so to justify
lazy error handling logic that can be trivially fixed. All that's needed
here is to ensure that the new error path eventually translates into a
shutdown like all other log I/O related errors.

My understanding is this patch is being dropped for now so this feedback
is just for reference should this logic re-emerge when we have a
bio_add_page() variant with new error semantics..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] xfs: add kmem_alloc_io()
  2019-08-22 14:26                       ` Vlastimil Babka
@ 2019-08-26 12:21                         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2019-08-26 12:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Dave Chinner, Peter Zijlstra, Christoph Hellwig, linux-xfs,
	Ingo Molnar, Will Deacon, linux-kernel, linux-mm, penguin-kernel

On Thu 22-08-19 16:26:42, Vlastimil Babka wrote:
> On 8/22/19 3:17 PM, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 02:19:04PM +0200, Vlastimil Babka wrote:
> >> On 8/22/19 2:07 PM, Dave Chinner wrote:
> >> > On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> >> > 
> >> > No, the problem is this (using kmalloc as a general term for
> >> > allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)
> >> > 
> >> >    some random kernel code
> >> >     kmalloc(GFP_KERNEL)
> >> >      reclaim
> >> >      PF_MEMALLOC
> >> >      shrink_slab
> >> >       xfs_inode_shrink
> >> >        XFS_ILOCK
> >> >         xfs_buf_allocate_memory()
> >> >          kmalloc(GFP_KERNEL)
> >> > 
> >> > And so locks on inodes in reclaim are seen below reclaim. Then
> >> > somewhere else we have:
> >> > 
> >> >    some high level read-only xfs code like readdir
> >> >     XFS_ILOCK
> >> >      xfs_buf_allocate_memory()
> >> >       kmalloc(GFP_KERNEL)
> >> >        reclaim
> >> > 
> >> > And this one throws false positive lockdep warnings because we
> >> > called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
> >> 
> >> OK, and what exactly makes this positive a false one? Why can't it continue like
> >> the first example where reclaim leads to another XFS_ILOCK, thus deadlock?
> > 
> > Because above reclaim we only have operations being done on
> > referenced inodes, and below reclaim we only have unreferenced
> > inodes. We never lock the same inode both above and below reclaim
> > at the same time.
> > 
> > IOWs, an operation above reclaim cannot see, access or lock
> > unreferenced inodes, except in inode write clustering, and that uses
> > trylocks so cannot deadlock with reclaim.
> > 
> > An operation below reclaim cannot see, access or lock referenced
> > inodes except during inode write clustering, and that uses trylocks
> > so cannot deadlock with code above reclaim.
> 
> Thanks for elaborating. Perhaps lockdep experts (not me) would know how to
> express that. If not possible, then replacing GFP_NOFS with __GFP_NOLOCKDEP
> should indeed suppress the warning, while allowing FS reclaim.

This was certainly my hope to happen when introducing __GFP_NOLOCKDEP.
I couldn't have done the second step because that requires a deep
understanding of the code in question which is beyond my capacity. It
seems we still haven't found a brave soul to start converting GFP_NOFS
to __GFP_NOLOCKDEP. And it would be really appreciated.

Thanks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2019-08-26 12:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21  8:38 [PATCH 0/3] xfs: avoid IO issues unaligned memory allocation Dave Chinner
2019-08-21  8:38 ` [PATCH 1/3] xfs: add kmem allocation trace points Dave Chinner
2019-08-21 13:34   ` Brian Foster
2019-08-21 23:20   ` Christoph Hellwig
2019-08-21  8:38 ` [PATCH 2/3] xfs: add kmem_alloc_io() Dave Chinner
2019-08-21 13:35   ` Brian Foster
2019-08-21 15:08     ` Darrick J. Wong
2019-08-21 21:24       ` Dave Chinner
2019-08-21 15:23     ` Eric Sandeen
2019-08-21 21:14     ` Dave Chinner
2019-08-22 13:40       ` Brian Foster
2019-08-22 22:39         ` Dave Chinner
2019-08-23 12:10           ` Brian Foster
2019-08-21 23:24   ` Christoph Hellwig
2019-08-22  0:31     ` Dave Chinner
2019-08-22  7:59       ` Christoph Hellwig
2019-08-22  8:51         ` Peter Zijlstra
2019-08-22  9:10           ` Peter Zijlstra
2019-08-22 10:14             ` Dave Chinner
2019-08-22 11:14               ` Vlastimil Babka
2019-08-22 12:07                 ` Dave Chinner
2019-08-22 12:19                   ` Vlastimil Babka
2019-08-22 13:17                     ` Dave Chinner
2019-08-22 14:26                       ` Vlastimil Babka
2019-08-26 12:21                         ` Michal Hocko
2019-08-21  8:38 ` [PATCH 3/3] xfs: alignment check bio buffers Dave Chinner
2019-08-21 13:39   ` Brian Foster
2019-08-21 21:39     ` Dave Chinner
2019-08-22 13:47       ` Brian Foster
2019-08-22 23:03         ` Dave Chinner
2019-08-23 12:33           ` Brian Foster
2019-08-21 23:30     ` Christoph Hellwig
2019-08-22  0:44       ` Dave Chinner
2019-08-21 23:29   ` Christoph Hellwig
2019-08-22  0:37     ` Dave Chinner
2019-08-22  8:03       ` Christoph Hellwig
2019-08-22 10:17         ` Dave Chinner
2019-08-22  2:50     ` Ming Lei
2019-08-22  4:49       ` Dave Chinner
2019-08-22  7:23         ` Ming Lei
2019-08-22  8:08         ` Christoph Hellwig
2019-08-22 10:20           ` Ming Lei
2019-08-23  0:14             ` Christoph Hellwig
2019-08-23  1:19               ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.