linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] block layer filter and block device snapshot module
@ 2020-10-21  9:04 Sergei Shtepa
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21  9:04 UTC (permalink / raw)
  To: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm, sergei.shtepa

Hello everyone! Requesting for your comments and suggestions.

# blk-filter

Block layer filter allows to intercept BIO requests to a block device.

Interception is performed at the very beginning of the BIO request
processing, and therefore does not affect the operation of the request
processing queue. This also makes it possible to intercept requests from
a specific block device, rather than from the entire disk.

The logic of the submit_bio function has been changed - since the
function execution results are not processed anywhere (except for swap
and direct-io) the function won't return a value anymore.

Now the submit_bio_direct() function is called whenever the result of
the blk_qc_t function is required. submit_bio_direct() is not
intercepted by the block layer filter. This is logical for swap and
direct-io.

Block layer filter allows you to enable and disable the filter driver on
the fly. When a new block device is added, the filter driver can start
filtering this device. When you delete a device, the filter can remove
its own filter.

The idea of multiple altitudes had to be abandoned in order to simplify
implementation and make it more reliable. Different filter drivers can
work simultaneously, but each on its own block device.

# blk-snap

We propose a new kernel module - blk-snap. This module implements
snapshot and changed block tracking functionality. It is intended to
create backup copies of any block devices without usage of device mapper.
Snapshots are temporary and are destroyed after the backup process has
finished. Changed block tracking allows for incremental and differential
backup copies.

blk-snap uses block layer filter. Block layer filter provides a callback
to intercept bio-requests. If a block device disappears for whatever
reason, send a synchronous request to remove the device from filtering.

blk-snap kernel module is a product of a deep refactoring of the
out-of-tree kernel veeamsnap kernel module
(https://github.com/veeam/veeamsnap/):
* all conditional compilation branches that served for the purpose of
  compatibility with older kernels have been removed;
* linux kernel code style has been applied;
* blk-snap mostly takes advantage of the existing kernel code instead of
  reinventing the wheel;
* all redundant code (such as persistent cbt and snapstore collector)
  has been removed.

Several important things are still have to be done:
* refactor the module interface for interaction with a user-space code -
  it is already clear that the implementation of some calls can be
  improved.

Your feedback would be greatly appreciated!

Sergei Shtepa (2):
  Block layer filter - second version
  blk-snap - snapshots and change-tracking for block devices

 block/Kconfig                               |  11 +
 block/Makefile                              |   1 +
 block/blk-core.c                            |  52 +-
 block/blk-filter-internal.h                 |  29 +
 block/blk-filter.c                          | 286 ++++++
 block/partitions/core.c                     |  14 +-
 drivers/block/Kconfig                       |   2 +
 drivers/block/Makefile                      |   1 +
 drivers/block/blk-snap/Kconfig              |  24 +
 drivers/block/blk-snap/Makefile             |  28 +
 drivers/block/blk-snap/big_buffer.c         | 193 ++++
 drivers/block/blk-snap/big_buffer.h         |  24 +
 drivers/block/blk-snap/blk-snap-ctl.h       | 190 ++++
 drivers/block/blk-snap/blk_deferred.c       | 566 +++++++++++
 drivers/block/blk-snap/blk_deferred.h       |  67 ++
 drivers/block/blk-snap/blk_descr_file.c     |  82 ++
 drivers/block/blk-snap/blk_descr_file.h     |  26 +
 drivers/block/blk-snap/blk_descr_mem.c      |  66 ++
 drivers/block/blk-snap/blk_descr_mem.h      |  14 +
 drivers/block/blk-snap/blk_descr_multidev.c |  86 ++
 drivers/block/blk-snap/blk_descr_multidev.h |  25 +
 drivers/block/blk-snap/blk_descr_pool.c     | 190 ++++
 drivers/block/blk-snap/blk_descr_pool.h     |  38 +
 drivers/block/blk-snap/blk_redirect.c       | 507 ++++++++++
 drivers/block/blk-snap/blk_redirect.h       |  73 ++
 drivers/block/blk-snap/blk_util.c           |  33 +
 drivers/block/blk-snap/blk_util.h           |  33 +
 drivers/block/blk-snap/cbt_map.c            | 210 +++++
 drivers/block/blk-snap/cbt_map.h            |  62 ++
 drivers/block/blk-snap/common.h             |  31 +
 drivers/block/blk-snap/ctrl_fops.c          | 691 ++++++++++++++
 drivers/block/blk-snap/ctrl_fops.h          |  19 +
 drivers/block/blk-snap/ctrl_pipe.c          | 562 +++++++++++
 drivers/block/blk-snap/ctrl_pipe.h          |  34 +
 drivers/block/blk-snap/ctrl_sysfs.c         |  73 ++
 drivers/block/blk-snap/ctrl_sysfs.h         |   5 +
 drivers/block/blk-snap/defer_io.c           | 397 ++++++++
 drivers/block/blk-snap/defer_io.h           |  39 +
 drivers/block/blk-snap/main.c               |  82 ++
 drivers/block/blk-snap/params.c             |  58 ++
 drivers/block/blk-snap/params.h             |  29 +
 drivers/block/blk-snap/rangevector.c        |  85 ++
 drivers/block/blk-snap/rangevector.h        |  31 +
 drivers/block/blk-snap/snapimage.c          | 982 ++++++++++++++++++++
 drivers/block/blk-snap/snapimage.h          |  16 +
 drivers/block/blk-snap/snapshot.c           | 225 +++++
 drivers/block/blk-snap/snapshot.h           |  17 +
 drivers/block/blk-snap/snapstore.c          | 929 ++++++++++++++++++
 drivers/block/blk-snap/snapstore.h          |  68 ++
 drivers/block/blk-snap/snapstore_device.c   | 532 +++++++++++
 drivers/block/blk-snap/snapstore_device.h   |  63 ++
 drivers/block/blk-snap/snapstore_file.c     |  52 ++
 drivers/block/blk-snap/snapstore_file.h     |  15 +
 drivers/block/blk-snap/snapstore_mem.c      |  91 ++
 drivers/block/blk-snap/snapstore_mem.h      |  20 +
 drivers/block/blk-snap/snapstore_multidev.c | 118 +++
 drivers/block/blk-snap/snapstore_multidev.h |  22 +
 drivers/block/blk-snap/tracker.c            | 449 +++++++++
 drivers/block/blk-snap/tracker.h            |  38 +
 drivers/block/blk-snap/tracking.c           | 270 ++++++
 drivers/block/blk-snap/tracking.h           |  13 +
 drivers/block/blk-snap/version.h            |   7 +
 fs/block_dev.c                              |   6 +-
 fs/direct-io.c                              |   2 +-
 fs/iomap/direct-io.c                        |   2 +-
 include/linux/bio.h                         |   4 +-
 include/linux/blk-filter.h                  |  76 ++
 include/linux/genhd.h                       |   8 +-
 kernel/power/swap.c                         |   2 +-
 mm/page_io.c                                |   4 +-
 70 files changed, 9074 insertions(+), 26 deletions(-)
 create mode 100644 block/blk-filter-internal.h
 create mode 100644 block/blk-filter.c
 create mode 100644 drivers/block/blk-snap/Kconfig
 create mode 100644 drivers/block/blk-snap/Makefile
 create mode 100644 drivers/block/blk-snap/big_buffer.c
 create mode 100644 drivers/block/blk-snap/big_buffer.h
 create mode 100644 drivers/block/blk-snap/blk-snap-ctl.h
 create mode 100644 drivers/block/blk-snap/blk_deferred.c
 create mode 100644 drivers/block/blk-snap/blk_deferred.h
 create mode 100644 drivers/block/blk-snap/blk_descr_file.c
 create mode 100644 drivers/block/blk-snap/blk_descr_file.h
 create mode 100644 drivers/block/blk-snap/blk_descr_mem.c
 create mode 100644 drivers/block/blk-snap/blk_descr_mem.h
 create mode 100644 drivers/block/blk-snap/blk_descr_multidev.c
 create mode 100644 drivers/block/blk-snap/blk_descr_multidev.h
 create mode 100644 drivers/block/blk-snap/blk_descr_pool.c
 create mode 100644 drivers/block/blk-snap/blk_descr_pool.h
 create mode 100644 drivers/block/blk-snap/blk_redirect.c
 create mode 100644 drivers/block/blk-snap/blk_redirect.h
 create mode 100644 drivers/block/blk-snap/blk_util.c
 create mode 100644 drivers/block/blk-snap/blk_util.h
 create mode 100644 drivers/block/blk-snap/cbt_map.c
 create mode 100644 drivers/block/blk-snap/cbt_map.h
 create mode 100644 drivers/block/blk-snap/common.h
 create mode 100644 drivers/block/blk-snap/ctrl_fops.c
 create mode 100644 drivers/block/blk-snap/ctrl_fops.h
 create mode 100644 drivers/block/blk-snap/ctrl_pipe.c
 create mode 100644 drivers/block/blk-snap/ctrl_pipe.h
 create mode 100644 drivers/block/blk-snap/ctrl_sysfs.c
 create mode 100644 drivers/block/blk-snap/ctrl_sysfs.h
 create mode 100644 drivers/block/blk-snap/defer_io.c
 create mode 100644 drivers/block/blk-snap/defer_io.h
 create mode 100644 drivers/block/blk-snap/main.c
 create mode 100644 drivers/block/blk-snap/params.c
 create mode 100644 drivers/block/blk-snap/params.h
 create mode 100644 drivers/block/blk-snap/rangevector.c
 create mode 100644 drivers/block/blk-snap/rangevector.h
 create mode 100644 drivers/block/blk-snap/snapimage.c
 create mode 100644 drivers/block/blk-snap/snapimage.h
 create mode 100644 drivers/block/blk-snap/snapshot.c
 create mode 100644 drivers/block/blk-snap/snapshot.h
 create mode 100644 drivers/block/blk-snap/snapstore.c
 create mode 100644 drivers/block/blk-snap/snapstore.h
 create mode 100644 drivers/block/blk-snap/snapstore_device.c
 create mode 100644 drivers/block/blk-snap/snapstore_device.h
 create mode 100644 drivers/block/blk-snap/snapstore_file.c
 create mode 100644 drivers/block/blk-snap/snapstore_file.h
 create mode 100644 drivers/block/blk-snap/snapstore_mem.c
 create mode 100644 drivers/block/blk-snap/snapstore_mem.h
 create mode 100644 drivers/block/blk-snap/snapstore_multidev.c
 create mode 100644 drivers/block/blk-snap/snapstore_multidev.h
 create mode 100644 drivers/block/blk-snap/tracker.c
 create mode 100644 drivers/block/blk-snap/tracker.h
 create mode 100644 drivers/block/blk-snap/tracking.c
 create mode 100644 drivers/block/blk-snap/tracking.h
 create mode 100644 drivers/block/blk-snap/version.h
 create mode 100644 include/linux/blk-filter.h

--
2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:04 [PATCH 0/2] block layer filter and block device snapshot module Sergei Shtepa
@ 2020-10-21  9:04 ` Sergei Shtepa
  2020-10-21  9:14   ` Johannes Thumshirn
                     ` (3 more replies)
       [not found] ` <1603271049-20681-3-git-send-email-sergei.shtepa@veeam.com>
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21  9:04 UTC (permalink / raw)
  To: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm, sergei.shtepa

Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
 block/Kconfig               |  11 ++
 block/Makefile              |   1 +
 block/blk-core.c            |  52 +++++--
 block/blk-filter-internal.h |  29 ++++
 block/blk-filter.c          | 286 ++++++++++++++++++++++++++++++++++++
 block/partitions/core.c     |  14 +-
 fs/block_dev.c              |   6 +-
 fs/direct-io.c              |   2 +-
 fs/iomap/direct-io.c        |   2 +-
 include/linux/bio.h         |   4 +-
 include/linux/blk-filter.h  |  76 ++++++++++
 include/linux/genhd.h       |   8 +-
 kernel/power/swap.c         |   2 +-
 mm/page_io.c                |   4 +-
 14 files changed, 471 insertions(+), 26 deletions(-)
 create mode 100644 block/blk-filter-internal.h
 create mode 100644 block/blk-filter.c
 create mode 100644 include/linux/blk-filter.h

diff --git a/block/Kconfig b/block/Kconfig
index bbad5e8bbffe..a308801b4376 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -204,6 +204,17 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
 	  by falling back to the kernel crypto API when inline
 	  encryption hardware is not present.
 
+config BLK_FILTER
+	bool "Enable support for block layer filters"
+	default y
+	depends on MODULES
+	help
+	  Enabling this lets third-party kernel modules intercept
+	  bio requests for any block device. This allows them to implement
+	  changed block tracking and snapshots without any reconfiguration of
+	  the existing setup. For example, this option allows snapshotting of
+	  a block device without adding it to LVM.
+
 menu "Partition Types"
 
 source "block/partitions/Kconfig"
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..b8ee50b8e031 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -38,3 +38,4 @@ obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
 obj-$(CONFIG_BLK_PM)		+= blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= keyslot-manager.o blk-crypto.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK)	+= blk-crypto-fallback.o
+obj-$(CONFIG_BLK_FILTER)	+= blk-filter.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 10c08ac50697..cc06402af695 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1216,23 +1216,20 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
 EXPORT_SYMBOL(submit_bio_noacct);
 
 /**
- * submit_bio - submit a bio to the block device layer for I/O
- * @bio: The &struct bio which describes the I/O
- *
- * submit_bio() is used to submit I/O requests to block devices.  It is passed a
- * fully set up &struct bio that describes the I/O that needs to be done.  The
- * bio will be send to the device described by the bi_disk and bi_partno fields.
+ * submit_bio_direct - submit a bio to the block device layer for I/O
+ * bypass filter.
+ * @bio:  The bio describing the location in memory and on the device.
  *
- * The success/failure status of the request, along with notification of
- * completion, is delivered asynchronously through the ->bi_end_io() callback
- * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
- * been called.
+ * Description:
+ *    This is a version of submit_bio() that shall only be used for I/O
+ *    that cannot be intercepted by block layer filters.
+ *    All file systems and other upper level users of the block layer
+ *    should use submit_bio() instead.
+ *    Use this function to access the swap partition and directly access
+ *    the block device file.
  */
-blk_qc_t submit_bio(struct bio *bio)
+blk_qc_t submit_bio_direct(struct bio *bio)
 {
-	if (blkcg_punt_bio_submit(bio))
-		return BLK_QC_T_NONE;
-
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
@@ -1282,8 +1279,35 @@ blk_qc_t submit_bio(struct bio *bio)
 
 	return submit_bio_noacct(bio);
 }
+EXPORT_SYMBOL(submit_bio_direct);
+
+/**
+ * submit_bio - submit a bio to the block device layer for I/O
+ * @bio: The &struct bio which describes the I/O
+ *
+ * submit_bio() is used to submit I/O requests to block devices.  It is passed a
+ * fully set up &struct bio that describes the I/O that needs to be done.  The
+ * bio will be send to the device described by the bi_disk and bi_partno fields.
+ *
+ * The success/failure status of the request, along with notification of
+ * completion, is delivered asynchronously through the ->bi_end_io() callback
+ * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
+ * been called.
+ */
+void submit_bio(struct bio *bio)
+{
+	if (blkcg_punt_bio_submit(bio))
+		return;
+
+#ifdef CONFIG_BLK_FILTER
+	blk_filter_submit_bio(bio);
+#else
+	submit_bio_direct(bio);
+#endif
+}
 EXPORT_SYMBOL(submit_bio);
 
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for the new queue limits
diff --git a/block/blk-filter-internal.h b/block/blk-filter-internal.h
new file mode 100644
index 000000000000..d456a09f50db
--- /dev/null
+++ b/block/blk-filter-internal.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ *
+ * Block device filters internal declarations
+ */
+
+#ifndef BLK_FILTER_INTERNAL_H
+#define BLK_FILTER_INTERNAL_H
+
+#ifdef CONFIG_BLK_FILTER
+#include <linux/blk-filter.h>
+
+void blk_filter_part_add(struct hd_struct *part, dev_t devt);
+
+void blk_filter_part_del(struct hd_struct *part);
+
+#else /* CONFIG_BLK_FILTER */
+
+
+static inline void blk_filter_part_add(struct hd_struct *part, dev_t devt)
+{ };
+
+static inline void blk_filter_part_del(struct hd_struct *part)
+{ };
+
+#endif /* CONFIG_BLK_FILTER */
+
+#endif
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..f6de16c45a16
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,286 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/genhd.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include "blk-filter-internal.h"
+#include <linux/rwsem.h>
+
+
+LIST_HEAD(filters);
+DECLARE_RWSEM(filters_lock);
+
+static void blk_filter_release(struct kref *kref)
+{
+	struct blk_filter *flt = container_of(kref, struct blk_filter, kref);
+
+	kfree(flt);
+}
+
+static inline void blk_filter_get(struct blk_filter *flt)
+{
+	kref_get(&flt->kref);
+}
+
+static inline void blk_filter_put(struct blk_filter *flt)
+{
+	kref_put(&flt->kref, blk_filter_release);
+}
+
+
+/**
+ * blk_filter_part_add() - Notify filters when a new partition is added.
+ * @part: The partition for new block device.
+ * @devt: Device id for new block device.
+ *
+ * Description:
+ *    When the block device is appears in the system, call the filter
+ *    callback to notify that the block device appears.
+ */
+void blk_filter_part_add(struct hd_struct *part, dev_t devt)
+{
+	down_read(&filters_lock);
+	if (!list_empty(&filters)) {
+		struct list_head *_list_head;
+
+		list_for_each(_list_head, &filters) {
+			void *filter_data;
+			bool attached = false;
+			struct blk_filter *flt;
+
+			flt = list_entry(_list_head, struct blk_filter, link);
+
+			attached = flt->ops->part_add(devt, &filter_data);
+			if (attached) {
+				blk_filter_get(flt);
+				part->filter = flt;
+				part->filter_data = filter_data;
+				break;
+			}
+		}
+	}
+	up_read(&filters_lock);
+}
+
+/**
+ * blk_filter_part_del() - Notify filters when the partition is deleted.
+ * @part: The partition of block device.
+ *
+ * Description:
+ *    When the block device is destroying and the partition is releasing,
+ *    call the filter callback to notify that the block device will be
+ *    deleted.
+ */
+void blk_filter_part_del(struct hd_struct *part)
+{
+	struct blk_filter *flt = part->filter;
+
+	if (!flt)
+		return;
+
+	flt->ops->part_del(part->filter_data);
+
+	part->filter_data = NULL;
+	part->filter = NULL;
+	blk_filter_put(flt);
+}
+
+
+/**
+ * blk_filter_submit_bio() - Send new bio to filters for processing.
+ * @bio: The new bio for block I/O layer.
+ *
+ * Description:
+ *    This function is an implementation of block layer filter
+ *    interception. If the filter is attached to this block device,
+ *    then bio will be redirected to the filter kernel module.
+ */
+void blk_filter_submit_bio(struct bio *bio)
+{
+	bool intercepted = false;
+	struct hd_struct *part;
+
+	bio_get(bio);
+
+	part = disk_get_part(bio->bi_disk, bio->bi_partno);
+	if (unlikely(!part)) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+
+		bio_put(bio);
+		return;
+	}
+
+	down_read(&part->filter_rw_lockup);
+
+	if (part->filter)
+		intercepted = part->filter->ops->filter_bio(bio, part->filter_data);
+
+	up_read(&part->filter_rw_lockup);
+
+	if (!intercepted)
+		submit_bio_direct(bio);
+
+	disk_put_part(part);
+
+	bio_put(bio);
+}
+EXPORT_SYMBOL(blk_filter_submit_bio);
+
+/**
+ * blk_filter_register() - Register block layer filter.
+ * @ops: New filter callbacks.
+ *
+ * Return:
+ *     Filter ID, a pointer to the service structure of the filter.
+ *
+ * Description:
+ *    Create new filter structure.
+ *    Use blk_filter_attach to attach devices to filter.
+ */
+void *blk_filter_register(struct blk_filter_ops *ops)
+{
+	struct blk_filter *flt;
+
+	flt = kzalloc(sizeof(struct blk_filter), GFP_KERNEL);
+	if (!flt)
+		return NULL;
+
+	kref_init(&flt->kref);
+	flt->ops = ops;
+
+	down_write(&filters_lock);
+	list_add_tail(&flt->link, &filters);
+	up_write(&filters_lock);
+
+	return flt;
+}
+EXPORT_SYMBOL(blk_filter_register);
+
+/**
+ * blk_filter_unregister() - Unregister block layer filter.
+ * @filter: filter identifier.
+ *
+ * Description:
+ *    Before call blk_filter_unregister() and unload filter module all
+ *    partitions MUST be detached. Otherwise, the system will have a
+ *    filter with non-existent interception functions.
+ */
+void blk_filter_unregister(void *filter)
+{
+	struct blk_filter *flt = filter;
+
+	down_write(&filters_lock);
+	list_del(&flt->link);
+	up_write(&filters_lock);
+
+	blk_filter_put(flt);
+}
+EXPORT_SYMBOL(blk_filter_unregister);
+
+/**
+ * blk_filter_attach() - Attach block layer filter.
+ * @devt: The block device identification number.
+ * @filter: Filter identifier.
+ * @filter_data: Specific filters data for this device.
+ *
+ * Return:
+ *    Return code.
+ *    -ENODEV - cannot find this device, it is OK if the device does not exist yet.
+ *    -EALREADY - this device is already attached to this filter.
+ *    -EBUSY - this device is already attached to the another filter.
+ *
+ * Description:
+ *    Attach the device to the block layer filter.
+ *    Only one filter can be attached to a single device.
+ */
+int blk_filter_attach(dev_t devt, void *filter, void *filter_data)
+{
+	int ret = 0;
+	struct blk_filter *flt = filter;
+	struct block_device *blk_dev;
+
+
+	blk_dev = bdget(devt);
+	if (!blk_dev)
+		return -ENODEV;
+
+	blk_filter_freeze(blk_dev);
+
+	if (blk_dev->bd_part->filter) {
+		if (blk_dev->bd_part->filter == flt)
+			ret = -EALREADY;
+		else
+			ret = -EBUSY;
+	} else {
+		blk_filter_get(flt);
+		blk_dev->bd_part->filter = flt;
+		blk_dev->bd_part->filter_data = filter_data;
+	}
+
+	blk_filter_thaw(blk_dev);
+
+	bdput(blk_dev);
+
+	return ret;
+}
+EXPORT_SYMBOL(blk_filter_attach);
+
+/**
+ * blk_filter_detach() - Detach block layer filter.
+ * @devt: The block device identification number.
+ *
+ * Description:
+ *    Detach the device from the block layer filter.
+ *    Do not forget detach all devices before calling the
+ *    blk_filter_unregister() function and unload the module!
+ */
+void blk_filter_detach(dev_t devt)
+{
+	struct blk_filter *flt;
+	struct block_device *blk_dev;
+
+	blk_dev = bdget(devt);
+	if (!blk_dev)
+		return;
+
+	blk_filter_freeze(blk_dev);
+
+	flt = blk_dev->bd_part->filter;
+	if (flt) {
+		blk_dev->bd_part->filter_data = NULL;
+		blk_dev->bd_part->filter = NULL;
+		blk_filter_put(flt);
+	}
+
+	blk_filter_thaw(blk_dev);
+
+	bdput(blk_dev);
+}
+EXPORT_SYMBOL(blk_filter_detach);
+
+/**
+ * blk_filter_freeze() - Lock bio submitting.
+ * @bdev: The block device pointer.
+ *
+ * Description:
+ *    Stop bio processing.
+ */
+void blk_filter_freeze(struct block_device *bdev)
+{
+	down_write(&bdev->bd_part->filter_rw_lockup);
+}
+EXPORT_SYMBOL(blk_filter_freeze);
+
+/**
+ * blk_filter_thaw() - Unlock bio submitting.
+ * @bdev: The block device pointer.
+ *
+ * Description:
+ *    Resume bio processing.
+ */
+void blk_filter_thaw(struct block_device *bdev)
+{
+	up_write(&bdev->bd_part->filter_rw_lockup);
+}
+EXPORT_SYMBOL(blk_filter_thaw);
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 722406b841df..6b845e98b9a1 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -11,6 +11,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/raid/detect.h>
 #include "check.h"
+#include "../blk-filter-internal.h"
 
 static int (*check_part[])(struct parsed_partitions *) = {
 	/*
@@ -320,9 +321,11 @@ int hd_ref_init(struct hd_struct *part)
  */
 void delete_partition(struct gendisk *disk, struct hd_struct *part)
 {
-	struct disk_part_tbl *ptbl =
-		rcu_dereference_protected(disk->part_tbl, 1);
+	struct disk_part_tbl *ptbl;
+
+	blk_filter_part_del(part);
 
+	ptbl = rcu_dereference_protected(disk->part_tbl, 1);
 	/*
 	 * ->part_tbl is referenced in this part's release handler, so
 	 *  we have to hold the disk device
@@ -412,6 +415,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	p->nr_sects = len;
 	p->partno = partno;
 	p->policy = get_disk_ro(disk);
+#ifdef CONFIG_BLK_FILTER
+	init_rwsem(&p->filter_rw_lockup);
+#endif
 
 	if (info) {
 		struct partition_meta_info *pinfo;
@@ -469,6 +475,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	/* everything is up and running, commence */
 	rcu_assign_pointer(ptbl->part[partno], p);
 
+	/*inform filter about a new partition*/
+	blk_filter_part_add(p, devt);
+
 	/* suppress uevent if the disk suppresses it */
 	if (!dev_get_uevent_suppress(ddev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);
@@ -552,6 +561,7 @@ int bdev_del_partition(struct block_device *bdev, int partno)
 		goto out_unlock;
 
 	sync_blockdev(bdevp);
+
 	invalidate_bdev(bdevp);
 
 	delete_partition(bdev->bd_disk, part);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8ae833e00443..431eae17fd8f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -237,7 +237,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_HIPRI)
 		bio_set_polled(&bio, iocb);
 
-	qc = submit_bio(&bio);
+	qc = submit_bio_direct(&bio);
 	for (;;) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(bio.bi_private))
@@ -400,7 +400,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				polled = true;
 			}
 
-			qc = submit_bio(bio);
+			qc = submit_bio_direct(bio);
 
 			if (polled)
 				WRITE_ONCE(iocb->ki_cookie, qc);
@@ -421,7 +421,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			atomic_inc(&dio->ref);
 		}
 
-		submit_bio(bio);
+		submit_bio_direct(bio);
 		bio = bio_alloc(GFP_KERNEL, nr_pages);
 	}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 183299892465..d9bb1b6f6814 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -459,7 +459,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 		sdio->submit_io(bio, dio->inode, sdio->logical_offset_in_bio);
 		dio->bio_cookie = BLK_QC_T_NONE;
 	} else
-		dio->bio_cookie = submit_bio(bio);
+		dio->bio_cookie = submit_bio_direct(bio);
 
 	sdio->bio = NULL;
 	sdio->boundary = 0;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..e05f20ce8b5f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -73,7 +73,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 				file_inode(dio->iocb->ki_filp),
 				iomap, bio, pos);
 	else
-		dio->submit.cookie = submit_bio(bio);
+		dio->submit.cookie = submit_bio_direct(bio);
 }
 
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..5b0a32697207 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -10,6 +10,7 @@
 #include <linux/ioprio.h>
 /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
 #include <linux/blk_types.h>
+#include <linux/blk-filter.h>
 
 #define BIO_DEBUG
 
@@ -411,7 +412,8 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
 }
 
-extern blk_qc_t submit_bio(struct bio *);
+extern blk_qc_t submit_bio_direct(struct bio *bio);
+extern void submit_bio(struct bio *bio);
 
 extern void bio_endio(struct bio *);
 
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..f3e79e5b4586
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * API declarations for kernel modules utilizing block device filters
+ */
+
+#ifndef BLK_FILTER_H
+#define BLK_FILTER_H
+
+#ifdef CONFIG_BLK_FILTER
+#include <linux/kref.h>
+
+struct blk_filter_ops {
+	/*
+	 * Intercept bio callback.
+	 *
+	 * Returns true if the request was intercepted and placed in the
+	 * queue for processing. Otherwise submit_bio_direct() calling
+	 * needed.
+	 */
+	bool (*filter_bio)(struct bio *bio, void *filter_data);
+
+	/*
+	 * Callback to a request to add block device to the filter.
+	 *
+	 * Returns true if the block device will be filtered.
+	 * p_filter_data gets a pointer to data that is unique to
+	 * this device.
+	 */
+	bool (*part_add)(dev_t devt, void **p_filter_data);
+
+	/*
+	 * Callback to remove block device from the filter.
+	 */
+	void (*part_del)(void *filter_data);
+};
+
+struct blk_filter {
+	struct list_head link;
+	struct kref kref;
+	struct blk_filter_ops *ops;
+};
+
+/*
+ * Register/unregister device to filter
+ */
+void *blk_filter_register(struct blk_filter_ops *ops);
+
+void blk_filter_unregister(void *filter);
+
+/*
+ * Attach/detach device to filter
+ */
+int blk_filter_attach(dev_t devt, void *filter, void *filter_data);
+
+void blk_filter_detach(dev_t devt);
+
+/*
+ * For a consistent state of the file system use the freeze_bdev/thaw_bdav.
+ * But in addition, to ensure that the filter is not in the state of
+ * intercepting the next BIO, you need to call black_filter_freeze/blk_filter_thaw.
+ * This is especially actual if there is no file system on the disk.
+ */
+
+void blk_filter_freeze(struct block_device *bdev);
+
+void blk_filter_thaw(struct block_device *bdev);
+
+/*
+ * Filters intercept function
+ */
+void blk_filter_submit_bio(struct bio *bio);
+
+#endif /* CONFIG_BLK_FILTER */
+
+#endif
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4ab853461dff..514fab6b947e 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -4,7 +4,7 @@
 
 /*
  * 	genhd.h Copyright (C) 1992 Drew Eckhardt
- *	Generic hard disk header file by  
+ *	Generic hard disk header file by
  * 		Drew Eckhardt
  *
  *		<drew@colorado.edu>
@@ -75,6 +75,12 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	struct rcu_work rcu_work;
+
+#ifdef CONFIG_BLK_FILTER
+	struct rw_semaphore filter_rw_lockup; /* for freezing block device*/
+	struct blk_filter *filter; /* block layer filter*/
+	void *filter_data; /*specific for each block device filters data*/
+#endif
 };
 
 /**
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 01e2858b5fe3..5287346b87a1 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -283,7 +283,7 @@ static int hib_submit_io(int op, int op_flags, pgoff_t page_off, void *addr,
 		bio->bi_end_io = hib_end_io;
 		bio->bi_private = hb;
 		atomic_inc(&hb->count);
-		submit_bio(bio);
+		submit_bio_direct(bio);
 	} else {
 		error = submit_bio_wait(bio);
 		bio_put(bio);
diff --git a/mm/page_io.c b/mm/page_io.c
index e485a6e8a6cd..4540426400b3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -362,7 +362,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	count_swpout_vm_event(page);
 	set_page_writeback(page);
 	unlock_page(page);
-	submit_bio(bio);
+	submit_bio_direct(bio);
 out:
 	return ret;
 }
@@ -434,7 +434,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}
 	count_vm_event(PSWPIN);
 	bio_get(bio);
-	qc = submit_bio(bio);
+	qc = submit_bio_direct(bio);
 	while (synchronous) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(bio->bi_private))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices
       [not found] ` <1603271049-20681-3-git-send-email-sergei.shtepa@veeam.com>
@ 2020-10-21  9:08   ` Pavel Machek
  2020-10-21  9:37     ` Sergei Shtepa
       [not found]   ` <BL0PR04MB65142D9F391FE8777F096EF5E71C0@BL0PR04MB6514.namprd04.prod.outlook.com>
  2020-10-21 15:11   ` Randy Dunlap
  2 siblings, 1 reply; 29+ messages in thread
From: Pavel Machek @ 2020-10-21  9:08 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, akpm, johannes.thumshirn, ming.lei, jack, tj, gustavo,
	bvanassche, osandov, koct9i, damien.lemoal, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]


We'll need some changelog here.

> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>


> --- /dev/null
> +++ b/drivers/block/blk-snap/blk-snap-ctl.h
> @@ -0,0 +1,190 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#pragma once

once?

> +#define MODULE_NAME "blk-snap"
> +#define SNAP_IMAGE_NAME "blk-snap-image"
> +
> +#define SUCCESS 0
> +
> +#define MAX_TRACKING_DEVICE_COUNT 256
> +
> +#define BLK_SNAP 'V'
> +
> +#pragma pack(push, 1)
> +//////////////////////////////////////////////////////////////////////////
> +// version

This is not normal comment style.

> +#define BLK_SNAP_COMPATIBILITY_SNAPSTORE 0x0000000000000001ull /* rudiment */
> +//#define BLK_SNAP_COMPATIBILITY_BTRFS	 0x0000000000000002ull /* rudiment */
> +#define BLK_SNAP_COMPATIBILITY_MULTIDEV 0x0000000000000004ull

Don't comment out code.

Best regards,
								Pavel

-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
@ 2020-10-21  9:14   ` Johannes Thumshirn
  2020-10-21 10:01     ` Sergei Shtepa
  2020-10-21  9:21   ` Damien Le Moal
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Johannes Thumshirn @ 2020-10-21  9:14 UTC (permalink / raw)
  To: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, Damien Le Moal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On 21/10/2020 11:04, Sergei Shtepa wrote:
> +	help
> +	  Enabling this lets third-party kernel modules intercept
> +	  bio requests for any block device. This allows them to implement

The "third-party kernel modules" part sounds a bit worrisome to me. Especially
as this functionality is based on EXPORT_SYMBOL()s without the GPL suffix.

I read it as a "allow a proprietary module to mess with bios", which is a big 
no-no to me.

Not providing any sort of changelog doesn't help much either.

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
  2020-10-21  9:14   ` Johannes Thumshirn
@ 2020-10-21  9:21   ` Damien Le Moal
  2020-10-21 10:27     ` Sergei Shtepa
  2020-10-21 11:44     ` Matthew Wilcox
  2020-10-21 15:09   ` Randy Dunlap
  2020-10-24 14:53   ` Greg KH
  3 siblings, 2 replies; 29+ messages in thread
From: Damien Le Moal @ 2020-10-21  9:21 UTC (permalink / raw)
  To: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On 2020/10/21 18:04, Sergei Shtepa wrote:
> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
> ---
>  block/Kconfig               |  11 ++
>  block/Makefile              |   1 +
>  block/blk-core.c            |  52 +++++--
>  block/blk-filter-internal.h |  29 ++++
>  block/blk-filter.c          | 286 ++++++++++++++++++++++++++++++++++++
>  block/partitions/core.c     |  14 +-
>  fs/block_dev.c              |   6 +-
>  fs/direct-io.c              |   2 +-
>  fs/iomap/direct-io.c        |   2 +-
>  include/linux/bio.h         |   4 +-
>  include/linux/blk-filter.h  |  76 ++++++++++
>  include/linux/genhd.h       |   8 +-
>  kernel/power/swap.c         |   2 +-
>  mm/page_io.c                |   4 +-
>  14 files changed, 471 insertions(+), 26 deletions(-)
>  create mode 100644 block/blk-filter-internal.h
>  create mode 100644 block/blk-filter.c
>  create mode 100644 include/linux/blk-filter.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index bbad5e8bbffe..a308801b4376 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -204,6 +204,17 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
>  	  by falling back to the kernel crypto API when inline
>  	  encryption hardware is not present.
>  
> +config BLK_FILTER
> +	bool "Enable support for block layer filters"
> +	default y
> +	depends on MODULES
> +	help
> +	  Enabling this lets third-party kernel modules intercept
> +	  bio requests for any block device. This allows them to implement
> +	  changed block tracking and snapshots without any reconfiguration of
> +	  the existing setup. For example, this option allows snapshotting of
> +	  a block device without adding it to LVM.
> +
>  menu "Partition Types"
>  
>  source "block/partitions/Kconfig"
> diff --git a/block/Makefile b/block/Makefile
> index 8d841f5f986f..b8ee50b8e031 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -38,3 +38,4 @@ obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
>  obj-$(CONFIG_BLK_PM)		+= blk-pm.o
>  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= keyslot-manager.o blk-crypto.o
>  obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK)	+= blk-crypto-fallback.o
> +obj-$(CONFIG_BLK_FILTER)	+= blk-filter.o
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 10c08ac50697..cc06402af695 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1216,23 +1216,20 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
>  EXPORT_SYMBOL(submit_bio_noacct);

Probably best to have this as its own patch last in the series.

>  
>  /**
> - * submit_bio - submit a bio to the block device layer for I/O
> - * @bio: The &struct bio which describes the I/O
> - *
> - * submit_bio() is used to submit I/O requests to block devices.  It is passed a
> - * fully set up &struct bio that describes the I/O that needs to be done.  The
> - * bio will be send to the device described by the bi_disk and bi_partno fields.
> + * submit_bio_direct - submit a bio to the block device layer for I/O
> + * bypass filter.
> + * @bio:  The bio describing the location in memory and on the device.
>   *
> - * The success/failure status of the request, along with notification of
> - * completion, is delivered asynchronously through the ->bi_end_io() callback
> - * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
> - * been called.
> + * Description:
> + *    This is a version of submit_bio() that shall only be used for I/O
> + *    that cannot be intercepted by block layer filters.
> + *    All file systems and other upper level users of the block layer
> + *    should use submit_bio() instead.
> + *    Use this function to access the swap partition and directly access
> + *    the block device file.
>   */
> -blk_qc_t submit_bio(struct bio *bio)
> +blk_qc_t submit_bio_direct(struct bio *bio)
>  {
> -	if (blkcg_punt_bio_submit(bio))
> -		return BLK_QC_T_NONE;
> -
>  	/*
>  	 * If it's a regular read/write or a barrier with data attached,
>  	 * go through the normal accounting stuff before submission.
> @@ -1282,8 +1279,35 @@ blk_qc_t submit_bio(struct bio *bio)
>  
>  	return submit_bio_noacct(bio);
>  }
> +EXPORT_SYMBOL(submit_bio_direct);

EXPORT_SYMBOL_GPL

> +
> +/**
> + * submit_bio - submit a bio to the block device layer for I/O
> + * @bio: The &struct bio which describes the I/O
> + *
> + * submit_bio() is used to submit I/O requests to block devices.  It is passed a
> + * fully set up &struct bio that describes the I/O that needs to be done.  The
> + * bio will be send to the device described by the bi_disk and bi_partno fields.
> + *
> + * The success/failure status of the request, along with notification of
> + * completion, is delivered asynchronously through the ->bi_end_io() callback
> + * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
> + * been called.
> + */
> +void submit_bio(struct bio *bio)
> +{
> +	if (blkcg_punt_bio_submit(bio))
> +		return;
> +
> +#ifdef CONFIG_BLK_FILTER> +	blk_filter_submit_bio(bio);
> +#else
> +	submit_bio_direct(bio);
> +#endif


	if (IS_ENABLED(CONFIG_BLK_FILTER))
		blk_filter_submit_bio(bio);
	else
		submit_bio_direct(bio);

is much cleaner...


> +}
>  EXPORT_SYMBOL(submit_bio);
>  
> +
>  /**
>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>   *                              for the new queue limits


The remaining should probably be a different patch before the above change.

> diff --git a/block/blk-filter-internal.h b/block/blk-filter-internal.h
> new file mode 100644
> index 000000000000..d456a09f50db
> --- /dev/null
> +++ b/block/blk-filter-internal.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + *
> + * Block device filters internal declarations
> + */
> +
> +#ifndef BLK_FILTER_INTERNAL_H
> +#define BLK_FILTER_INTERNAL_H
> +
> +#ifdef CONFIG_BLK_FILTER
> +#include <linux/blk-filter.h>
> +
> +void blk_filter_part_add(struct hd_struct *part, dev_t devt);
> +
> +void blk_filter_part_del(struct hd_struct *part);
> +
> +#else /* CONFIG_BLK_FILTER */
> +
> +

double blank line

> +static inline void blk_filter_part_add(struct hd_struct *part, dev_t devt)
> +{ };
> +
> +static inline void blk_filter_part_del(struct hd_struct *part)
> +{ };
> +
> +#endif /* CONFIG_BLK_FILTER */
> +
> +#endif
> diff --git a/block/blk-filter.c b/block/blk-filter.c
> new file mode 100644
> index 000000000000..f6de16c45a16
> --- /dev/null
> +++ b/block/blk-filter.c
> @@ -0,0 +1,286 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/genhd.h>
> +#include <linux/bio.h>
> +#include <linux/blkdev.h>
> +#include "blk-filter-internal.h"
> +#include <linux/rwsem.h>
> +
> +

Again here

> +LIST_HEAD(filters);
> +DECLARE_RWSEM(filters_lock);
> +
> +static void blk_filter_release(struct kref *kref)
> +{
> +	struct blk_filter *flt = container_of(kref, struct blk_filter, kref);
> +
> +	kfree(flt);
> +}
> +
> +static inline void blk_filter_get(struct blk_filter *flt)
> +{
> +	kref_get(&flt->kref);
> +}
> +
> +static inline void blk_filter_put(struct blk_filter *flt)
> +{
> +	kref_put(&flt->kref, blk_filter_release);
> +}
> +
> +
> +/**
> + * blk_filter_part_add() - Notify filters when a new partition is added.
> + * @part: The partition for new block device.
> + * @devt: Device id for new block device.
> + *
> + * Description:
> + *    When the block device is appears in the system, call the filter
> + *    callback to notify that the block device appears.
> + */
> +void blk_filter_part_add(struct hd_struct *part, dev_t devt)
> +{
> +	down_read(&filters_lock);
> +	if (!list_empty(&filters)) {
> +		struct list_head *_list_head;
> +
> +		list_for_each(_list_head, &filters) {
> +			void *filter_data;
> +			bool attached = false;
> +			struct blk_filter *flt;
> +
> +			flt = list_entry(_list_head, struct blk_filter, link);
> +
> +			attached = flt->ops->part_add(devt, &filter_data);
> +			if (attached) {
> +				blk_filter_get(flt);
> +				part->filter = flt;
> +				part->filter_data = filter_data;
> +				break;
> +			}
> +		}
> +	}
> +	up_read(&filters_lock);
> +}
> +
> +/**
> + * blk_filter_part_del() - Notify filters when the partition is deleted.
> + * @part: The partition of block device.
> + *
> + * Description:
> + *    When the block device is destroying and the partition is releasing,
> + *    call the filter callback to notify that the block device will be
> + *    deleted.
> + */
> +void blk_filter_part_del(struct hd_struct *part)
> +{
> +	struct blk_filter *flt = part->filter;
> +
> +	if (!flt)
> +		return;
> +
> +	flt->ops->part_del(part->filter_data);
> +
> +	part->filter_data = NULL;
> +	part->filter = NULL;
> +	blk_filter_put(flt);
> +}
> +
> +
> +/**
> + * blk_filter_submit_bio() - Send new bio to filters for processing.
> + * @bio: The new bio for block I/O layer.
> + *
> + * Description:
> + *    This function is an implementation of block layer filter
> + *    interception. If the filter is attached to this block device,
> + *    then bio will be redirected to the filter kernel module.
> + */
> +void blk_filter_submit_bio(struct bio *bio)
> +{
> +	bool intercepted = false;
> +	struct hd_struct *part;
> +
> +	bio_get(bio);
> +
> +	part = disk_get_part(bio->bi_disk, bio->bi_partno);
> +	if (unlikely(!part)) {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +
> +		bio_put(bio);
> +		return;
> +	}
> +
> +	down_read(&part->filter_rw_lockup);
> +
> +	if (part->filter)
> +		intercepted = part->filter->ops->filter_bio(bio, part->filter_data);
> +
> +	up_read(&part->filter_rw_lockup);
> +
> +	if (!intercepted)
> +		submit_bio_direct(bio);
> +
> +	disk_put_part(part);
> +
> +	bio_put(bio);
> +}
> +EXPORT_SYMBOL(blk_filter_submit_bio);
> +
> +/**
> + * blk_filter_register() - Register block layer filter.
> + * @ops: New filter callbacks.
> + *
> + * Return:
> + *     Filter ID, a pointer to the service structure of the filter.
> + *
> + * Description:
> + *    Create new filter structure.
> + *    Use blk_filter_attach to attach devices to filter.
> + */
> +void *blk_filter_register(struct blk_filter_ops *ops)
> +{
> +	struct blk_filter *flt;
> +
> +	flt = kzalloc(sizeof(struct blk_filter), GFP_KERNEL);
> +	if (!flt)
> +		return NULL;
> +
> +	kref_init(&flt->kref);
> +	flt->ops = ops;
> +
> +	down_write(&filters_lock);
> +	list_add_tail(&flt->link, &filters);
> +	up_write(&filters_lock);
> +
> +	return flt;
> +}
> +EXPORT_SYMBOL(blk_filter_register);
> +
> +/**
> + * blk_filter_unregister() - Unregister block layer filter.
> + * @filter: filter identifier.
> + *
> + * Description:
> + *    Before call blk_filter_unregister() and unload filter module all
> + *    partitions MUST be detached. Otherwise, the system will have a
> + *    filter with non-existent interception functions.
> + */
> +void blk_filter_unregister(void *filter)
> +{
> +	struct blk_filter *flt = filter;
> +
> +	down_write(&filters_lock);
> +	list_del(&flt->link);
> +	up_write(&filters_lock);
> +
> +	blk_filter_put(flt);
> +}
> +EXPORT_SYMBOL(blk_filter_unregister);
> +
> +/**
> + * blk_filter_attach() - Attach block layer filter.
> + * @devt: The block device identification number.
> + * @filter: Filter identifier.
> + * @filter_data: Specific filters data for this device.
> + *
> + * Return:
> + *    Return code.
> + *    -ENODEV - cannot find this device, it is OK if the device does not exist yet.
> + *    -EALREADY - this device is already attached to this filter.
> + *    -EBUSY - this device is already attached to the another filter.
> + *
> + * Description:
> + *    Attach the device to the block layer filter.
> + *    Only one filter can be attached to a single device.
> + */
> +int blk_filter_attach(dev_t devt, void *filter, void *filter_data)
> +{
> +	int ret = 0;
> +	struct blk_filter *flt = filter;
> +	struct block_device *blk_dev;
> +
> +
> +	blk_dev = bdget(devt);
> +	if (!blk_dev)
> +		return -ENODEV;
> +
> +	blk_filter_freeze(blk_dev);
> +
> +	if (blk_dev->bd_part->filter) {
> +		if (blk_dev->bd_part->filter == flt)
> +			ret = -EALREADY;
> +		else
> +			ret = -EBUSY;
> +	} else {
> +		blk_filter_get(flt);
> +		blk_dev->bd_part->filter = flt;
> +		blk_dev->bd_part->filter_data = filter_data;
> +	}
> +
> +	blk_filter_thaw(blk_dev);
> +
> +	bdput(blk_dev);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(blk_filter_attach);
> +
> +/**
> + * blk_filter_detach() - Detach block layer filter.
> + * @devt: The block device identification number.
> + *
> + * Description:
> + *    Detach the device from the block layer filter.
> + *    Do not forget detach all devices before calling the
> + *    blk_filter_unregister() function and unload the module!
> + */
> +void blk_filter_detach(dev_t devt)
> +{
> +	struct blk_filter *flt;
> +	struct block_device *blk_dev;
> +
> +	blk_dev = bdget(devt);
> +	if (!blk_dev)
> +		return;
> +
> +	blk_filter_freeze(blk_dev);
> +
> +	flt = blk_dev->bd_part->filter;
> +	if (flt) {
> +		blk_dev->bd_part->filter_data = NULL;
> +		blk_dev->bd_part->filter = NULL;
> +		blk_filter_put(flt);
> +	}
> +
> +	blk_filter_thaw(blk_dev);
> +
> +	bdput(blk_dev);
> +}
> +EXPORT_SYMBOL(blk_filter_detach);

All the EXPORT_SYMBOL should probably be EXPORT_SYMBOL_GPL.

> +
> +/**
> + * blk_filter_freeze() - Lock bio submitting.
> + * @bdev: The block device pointer.
> + *
> + * Description:
> + *    Stop bio processing.
> + */
> +void blk_filter_freeze(struct block_device *bdev)
> +{
> +	down_write(&bdev->bd_part->filter_rw_lockup);
> +}
> +EXPORT_SYMBOL(blk_filter_freeze);
> +
> +/**
> + * blk_filter_thaw() - Unlock bio submitting.
> + * @bdev: The block device pointer.
> + *
> + * Description:
> + *    Resume bio processing.
> + */
> +void blk_filter_thaw(struct block_device *bdev)
> +{
> +	up_write(&bdev->bd_part->filter_rw_lockup);
> +}
> +EXPORT_SYMBOL(blk_filter_thaw);
> diff --git a/block/partitions/core.c b/block/partitions/core.c
> index 722406b841df..6b845e98b9a1 100644
> --- a/block/partitions/core.c
> +++ b/block/partitions/core.c
> @@ -11,6 +11,7 @@
>  #include <linux/blktrace_api.h>
>  #include <linux/raid/detect.h>
>  #include "check.h"
> +#include "../blk-filter-internal.h"
>  
>  static int (*check_part[])(struct parsed_partitions *) = {
>  	/*
> @@ -320,9 +321,11 @@ int hd_ref_init(struct hd_struct *part)
>   */
>  void delete_partition(struct gendisk *disk, struct hd_struct *part)
>  {
> -	struct disk_part_tbl *ptbl =
> -		rcu_dereference_protected(disk->part_tbl, 1);
> +	struct disk_part_tbl *ptbl;
> +
> +	blk_filter_part_del(part);
>  
> +	ptbl = rcu_dereference_protected(disk->part_tbl, 1);
>  	/*
>  	 * ->part_tbl is referenced in this part's release handler, so
>  	 *  we have to hold the disk device
> @@ -412,6 +415,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	p->nr_sects = len;
>  	p->partno = partno;
>  	p->policy = get_disk_ro(disk);
> +#ifdef CONFIG_BLK_FILTER
> +	init_rwsem(&p->filter_rw_lockup);
> +#endif
>  
>  	if (info) {
>  		struct partition_meta_info *pinfo;
> @@ -469,6 +475,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	/* everything is up and running, commence */
>  	rcu_assign_pointer(ptbl->part[partno], p);
>  
> +	/*inform filter about a new partition*/
> +	blk_filter_part_add(p, devt);
> +
>  	/* suppress uevent if the disk suppresses it */
>  	if (!dev_get_uevent_suppress(ddev))
>  		kobject_uevent(&pdev->kobj, KOBJ_ADD);
> @@ -552,6 +561,7 @@ int bdev_del_partition(struct block_device *bdev, int partno)
>  		goto out_unlock;
>  
>  	sync_blockdev(bdevp);
> +
>  	invalidate_bdev(bdevp);
>  
>  	delete_partition(bdev->bd_disk, part);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 8ae833e00443..431eae17fd8f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -237,7 +237,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_HIPRI)
>  		bio_set_polled(&bio, iocb);
>  
> -	qc = submit_bio(&bio);
> +	qc = submit_bio_direct(&bio);
>  	for (;;) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		if (!READ_ONCE(bio.bi_private))
> @@ -400,7 +400,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  				polled = true;
>  			}
>  
> -			qc = submit_bio(bio);
> +			qc = submit_bio_direct(bio);
>  
>  			if (polled)
>  				WRITE_ONCE(iocb->ki_cookie, qc);
> @@ -421,7 +421,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  			atomic_inc(&dio->ref);
>  		}
>  
> -		submit_bio(bio);
> +		submit_bio_direct(bio);
>  		bio = bio_alloc(GFP_KERNEL, nr_pages);
>  	}
>  
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 183299892465..d9bb1b6f6814 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -459,7 +459,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
>  		sdio->submit_io(bio, dio->inode, sdio->logical_offset_in_bio);
>  		dio->bio_cookie = BLK_QC_T_NONE;
>  	} else
> -		dio->bio_cookie = submit_bio(bio);
> +		dio->bio_cookie = submit_bio_direct(bio);

All these changes are unnecessary if you reverse things: submit_bio() is kept as
the direct version (as today) and you use a "submit_bio_filtered()" where needed.

>  
>  	sdio->bio = NULL;
>  	sdio->boundary = 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..e05f20ce8b5f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -73,7 +73,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  				file_inode(dio->iocb->ki_filp),
>  				iomap, bio, pos);
>  	else
> -		dio->submit.cookie = submit_bio(bio);
> +		dio->submit.cookie = submit_bio_direct(bio);
>  }
>  
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c6d765382926..5b0a32697207 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioprio.h>
>  /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
>  #include <linux/blk_types.h>
> +#include <linux/blk-filter.h>
>  
>  #define BIO_DEBUG
>  
> @@ -411,7 +412,8 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
>  	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
>  }
>  
> -extern blk_qc_t submit_bio(struct bio *);
> +extern blk_qc_t submit_bio_direct(struct bio *bio);
> +extern void submit_bio(struct bio *bio);
>  
>  extern void bio_endio(struct bio *);
>  
> diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
> new file mode 100644
> index 000000000000..f3e79e5b4586
> --- /dev/null
> +++ b/include/linux/blk-filter.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * API declarations for kernel modules utilizing block device filters
> + */
> +
> +#ifndef BLK_FILTER_H
> +#define BLK_FILTER_H
> +
> +#ifdef CONFIG_BLK_FILTER
> +#include <linux/kref.h>
> +
> +struct blk_filter_ops {
> +	/*
> +	 * Intercept bio callback.
> +	 *
> +	 * Returns true if the request was intercepted and placed in the
> +	 * queue for processing. Otherwise submit_bio_direct() calling
> +	 * needed.
> +	 */
> +	bool (*filter_bio)(struct bio *bio, void *filter_data);
> +
> +	/*
> +	 * Callback to a request to add block device to the filter.
> +	 *
> +	 * Returns true if the block device will be filtered.
> +	 * p_filter_data gets a pointer to data that is unique to
> +	 * this device.
> +	 */
> +	bool (*part_add)(dev_t devt, void **p_filter_data);
> +
> +	/*
> +	 * Callback to remove block device from the filter.
> +	 */
> +	void (*part_del)(void *filter_data);
> +};
> +
> +struct blk_filter {
> +	struct list_head link;
> +	struct kref kref;
> +	struct blk_filter_ops *ops;
> +};
> +
> +/*
> + * Register/unregister device to filter
> + */
> +void *blk_filter_register(struct blk_filter_ops *ops);
> +
> +void blk_filter_unregister(void *filter);
> +
> +/*
> + * Attach/detach device to filter
> + */
> +int blk_filter_attach(dev_t devt, void *filter, void *filter_data);
> +
> +void blk_filter_detach(dev_t devt);
> +
> +/*
> + * For a consistent state of the file system use the freeze_bdev/thaw_bdav.
> + * But in addition, to ensure that the filter is not in the state of
> + * intercepting the next BIO, you need to call black_filter_freeze/blk_filter_thaw.
> + * This is especially actual if there is no file system on the disk.
> + */
> +
> +void blk_filter_freeze(struct block_device *bdev);
> +
> +void blk_filter_thaw(struct block_device *bdev);
> +
> +/*
> + * Filters intercept function
> + */
> +void blk_filter_submit_bio(struct bio *bio);
> +
> +#endif /* CONFIG_BLK_FILTER */
> +
> +#endif
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 4ab853461dff..514fab6b947e 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -4,7 +4,7 @@
>  
>  /*
>   * 	genhd.h Copyright (C) 1992 Drew Eckhardt
> - *	Generic hard disk header file by  
> + *	Generic hard disk header file by
>   * 		Drew Eckhardt
>   *
>   *		<drew@colorado.edu>
> @@ -75,6 +75,12 @@ struct hd_struct {
>  	int make_it_fail;
>  #endif
>  	struct rcu_work rcu_work;
> +
> +#ifdef CONFIG_BLK_FILTER
> +	struct rw_semaphore filter_rw_lockup; /* for freezing block device*/
> +	struct blk_filter *filter; /* block layer filter*/
> +	void *filter_data; /*specific for each block device filters data*/
> +#endif
>  };
>  
>  /**
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 01e2858b5fe3..5287346b87a1 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -283,7 +283,7 @@ static int hib_submit_io(int op, int op_flags, pgoff_t page_off, void *addr,
>  		bio->bi_end_io = hib_end_io;
>  		bio->bi_private = hb;
>  		atomic_inc(&hb->count);
> -		submit_bio(bio);
> +		submit_bio_direct(bio);
>  	} else {
>  		error = submit_bio_wait(bio);
>  		bio_put(bio);
> diff --git a/mm/page_io.c b/mm/page_io.c
> index e485a6e8a6cd..4540426400b3 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -362,7 +362,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  	count_swpout_vm_event(page);
>  	set_page_writeback(page);
>  	unlock_page(page);
> -	submit_bio(bio);
> +	submit_bio_direct(bio);
>  out:
>  	return ret;
>  }
> @@ -434,7 +434,7 @@ int swap_readpage(struct page *page, bool synchronous)
>  	}
>  	count_vm_event(PSWPIN);
>  	bio_get(bio);
> -	qc = submit_bio(bio);
> +	qc = submit_bio_direct(bio);
>  	while (synchronous) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		if (!READ_ONCE(bio->bi_private))
> 

Separate into multiple patches: one that introduces the filter functions/ops
code and another that changes the block layer where needed.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices
  2020-10-21  9:08   ` [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices Pavel Machek
@ 2020-10-21  9:37     ` Sergei Shtepa
  0 siblings, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21  9:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, akpm, johannes.thumshirn, ming.lei, jack, tj, gustavo,
	bvanassche, osandov, koct9i, damien.lemoal, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

#pragma once still banned? I think need to add a check for this ./scripts/checkpatch.pl.
Comment code style - ok, thank you.
-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:14   ` Johannes Thumshirn
@ 2020-10-21 10:01     ` Sergei Shtepa
  0 siblings, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 10:01 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, ming.lei, jack, tj, gustavo, bvanassche,
	osandov, koct9i, Damien Le Moal, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

The 10/21/2020 12:14, Johannes Thumshirn wrote:
> On 21/10/2020 11:04, Sergei Shtepa wrote:
> > +	help
> > +	  Enabling this lets third-party kernel modules intercept
> > +	  bio requests for any block device. This allows them to implement
> 
> The "third-party kernel modules" part sounds a bit worrisome to me. Especially
> as this functionality is based on EXPORT_SYMBOL()s without the GPL suffix.
> 
> I read it as a "allow a proprietary module to mess with bios", which is a big 
> no-no to me.
> 
> Not providing any sort of changelog doesn't help much either.
> 
> Thanks,
> 	Johannes
> 

I think the words "third-party" are is not necessary.
In my opinion, creating proprietary kernel modules for Linux is an empty idea.

EXPORT_SYMBOL() -> EXPORT_SYMBOL_GPL() - no problem.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:21   ` Damien Le Moal
@ 2020-10-21 10:27     ` Sergei Shtepa
  2020-10-21 11:44     ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 10:27 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, Johannes Thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

EXPORT_SYMBOL_GPL - ok.

#ifdef CONFIG_BLK_FILTER or IS_ENABLED() - It's a matter of habit.

> double blank line
Ok, I did.
Looks like a candidate for ./scripts/checkpatch.pl.

> Separate into multiple patches: one that introduces the filter
> functions/ops code and another that changes the block layer where needed.

I'll think about it. Personally, it seems to me that this separation
does not make it easier to understand the code. 
It is important for me to know immediately where the function is called,
and this determines its behavior.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices
       [not found]   ` <BL0PR04MB65142D9F391FE8777F096EF5E71C0@BL0PR04MB6514.namprd04.prod.outlook.com>
@ 2020-10-21 11:15     ` Sergei Shtepa
  0 siblings, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 11:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, Johannes Thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

> And this is a 8600+ lines patch.
> Can you split this into manageable pieces ?

Yes, the module was quite large. But I think it's not good to show
the elephant in parts. 
https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

> I do not think anybody will review such a huge patch.

Yes, it will be a lot of work. But I hope that the code architecture
and splitting entities into separate files will help.

If someone can advise how to divide a module into a chain of patches,
I will be happy. I do not dare to divide it without losing meaning.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:21   ` Damien Le Moal
  2020-10-21 10:27     ` Sergei Shtepa
@ 2020-10-21 11:44     ` Matthew Wilcox
  2020-10-21 12:55       ` Sergei Shtepa
  1 sibling, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2020-10-21 11:44 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On Wed, Oct 21, 2020 at 09:21:36AM +0000, Damien Le Moal wrote:
> > + * submit_bio_direct - submit a bio to the block device layer for I/O
> > + * bypass filter.
> > + * @bio:  The bio describing the location in memory and on the device.
> >   *
> > + * Description:

You don't need this line.

> > + *    This is a version of submit_bio() that shall only be used for I/O
> > + *    that cannot be intercepted by block layer filters.
> > + *    All file systems and other upper level users of the block layer
> > + *    should use submit_bio() instead.
> > + *    Use this function to access the swap partition and directly access
> > + *    the block device file.

I don't understand why O_DIRECT gets to bypass the block filter.  Nor do
I understand why anybody would place a block filter on the swap device.
But if somebody did place a filter on the swap device, why should swap
be able to bypass the filter?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21 11:44     ` Matthew Wilcox
@ 2020-10-21 12:55       ` Sergei Shtepa
  2020-10-21 13:07         ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 12:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Damien Le Moal, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

The 10/21/2020 14:44, Matthew Wilcox wrote:
> On Wed, Oct 21, 2020 at 09:21:36AM +0000, Damien Le Moal wrote:
> > > + * submit_bio_direct - submit a bio to the block device layer for I/O
> > > + * bypass filter.
> > > + * @bio:  The bio describing the location in memory and on the device.
> > >   *
> > > + * Description:
> 
> You don't need this line.
> 
> > > + *    This is a version of submit_bio() that shall only be used for I/O
> > > + *    that cannot be intercepted by block layer filters.
> > > + *    All file systems and other upper level users of the block layer
> > > + *    should use submit_bio() instead.
> > > + *    Use this function to access the swap partition and directly access
> > > + *    the block device file.
> 
> I don't understand why O_DIRECT gets to bypass the block filter.  Nor do
> I understand why anybody would place a block filter on the swap device.
> But if somebody did place a filter on the swap device, why should swap
> be able to bypass the filter?
> 

I am very happy to hear such a question. You are really trying to
understand the algorithm.

Yes, intercepting the swap partition is absurd. But we can't guarantee
that the filter won't intercept swap.

Swap operation is related to the memory allocation logic. If a swap on
the block device are accessed during memory allocation from filter,
a deadlock occurs. We can allow filters to occasionally shoot off their
feet, especially under high load. But I think it's better not to do it.

"directly access" - it is not O_DIRECT. This means (I think) direct
reading from the device file, like "dd if=/dev/sda1".
As for intercepting direct reading, I don't know how to do the right thing.

The problem here is that in fs/block_dev.c in function __blkdev_direct_IO()
uses the qc - value returned by the submit_bio() function.
This value is used below when calling 
blk_poll(bdev_get_queue(dev), qc, true).
The filter cannot return a meaningful value of the blk_qc_t type when
intercepting a request, because at that time it does not know which queue
the request will fall into.

If function submit_bio() will always return BLK_QC_T_NONE - I think the
algorithm of the __blk dev_direct_IO() will not work correctly.
If we need to intercept direct access to a block device, we need to at
least redo the __blkdev_direct_IO function, getting rid of blk_pool.
I'm not sure it's necessary yet.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21 12:55       ` Sergei Shtepa
@ 2020-10-21 13:07         ` Matthew Wilcox
  2020-10-21 14:35           ` Sergei Shtepa
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2020-10-21 13:07 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Damien Le Moal, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On Wed, Oct 21, 2020 at 03:55:55PM +0300, Sergei Shtepa wrote:
> The 10/21/2020 14:44, Matthew Wilcox wrote:
> > I don't understand why O_DIRECT gets to bypass the block filter.  Nor do
> > I understand why anybody would place a block filter on the swap device.
> > But if somebody did place a filter on the swap device, why should swap
> > be able to bypass the filter?
> 
> Yes, intercepting the swap partition is absurd. But we can't guarantee
> that the filter won't intercept swap.
> 
> Swap operation is related to the memory allocation logic. If a swap on
> the block device are accessed during memory allocation from filter,
> a deadlock occurs. We can allow filters to occasionally shoot off their
> feet, especially under high load. But I think it's better not to do it.

We already have logic to prevent this in Linux.  Filters need to
call memalloc_noio_save() while they might cause swap to happen and
memalloc_noio_restore() once it's safe for them to cause swap again.

> "directly access" - it is not O_DIRECT. This means (I think) direct
> reading from the device file, like "dd if=/dev/sda1".
> As for intercepting direct reading, I don't know how to do the right thing.
> 
> The problem here is that in fs/block_dev.c in function __blkdev_direct_IO()
> uses the qc - value returned by the submit_bio() function.
> This value is used below when calling 
> blk_poll(bdev_get_queue(dev), qc, true).
> The filter cannot return a meaningful value of the blk_qc_t type when
> intercepting a request, because at that time it does not know which queue
> the request will fall into.
> 
> If function submit_bio() will always return BLK_QC_T_NONE - I think the
> algorithm of the __blk dev_direct_IO() will not work correctly.
> If we need to intercept direct access to a block device, we need to at
> least redo the __blkdev_direct_IO function, getting rid of blk_pool.
> I'm not sure it's necessary yet.

This isn't part of the block layer that I'm familiar with, so I can't
help solve this problem, but allowing O_DIRECT to bypass the block filter
is a hole that needs to be fixed before these patches can be considered.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-21  9:04 [PATCH 0/2] block layer filter and block device snapshot module Sergei Shtepa
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
       [not found] ` <1603271049-20681-3-git-send-email-sergei.shtepa@veeam.com>
@ 2020-10-21 13:31 ` Hannes Reinecke
  2020-10-21 14:10   ` Sergei Shtepa
  2020-10-22 18:35 ` Mike Snitzer
  3 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2020-10-21 13:31 UTC (permalink / raw)
  To: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, johannes.thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i,
	damien.lemoal, steve, linux-block, linux-kernel, linux-pm,
	linux-mm

On 10/21/20 11:04 AM, Sergei Shtepa wrote:
> Hello everyone! Requesting for your comments and suggestions.
> 
> # blk-filter
> 
> Block layer filter allows to intercept BIO requests to a block device.
> 
> Interception is performed at the very beginning of the BIO request
> processing, and therefore does not affect the operation of the request
> processing queue. This also makes it possible to intercept requests from
> a specific block device, rather than from the entire disk.
> 
> The logic of the submit_bio function has been changed - since the
> function execution results are not processed anywhere (except for swap
> and direct-io) the function won't return a value anymore.
> 
> Now the submit_bio_direct() function is called whenever the result of
> the blk_qc_t function is required. submit_bio_direct() is not
> intercepted by the block layer filter. This is logical for swap and
> direct-io.
> 
> Block layer filter allows you to enable and disable the filter driver on
> the fly. When a new block device is added, the filter driver can start
> filtering this device. When you delete a device, the filter can remove
> its own filter.
> 
> The idea of multiple altitudes had to be abandoned in order to simplify
> implementation and make it more reliable. Different filter drivers can
> work simultaneously, but each on its own block device.
> 
> # blk-snap
> 
> We propose a new kernel module - blk-snap. This module implements
> snapshot and changed block tracking functionality. It is intended to
> create backup copies of any block devices without usage of device mapper.
> Snapshots are temporary and are destroyed after the backup process has
> finished. Changed block tracking allows for incremental and differential
> backup copies.
> 
> blk-snap uses block layer filter. Block layer filter provides a callback
> to intercept bio-requests. If a block device disappears for whatever
> reason, send a synchronous request to remove the device from filtering.
> 
> blk-snap kernel module is a product of a deep refactoring of the
> out-of-tree kernel veeamsnap kernel module
> (https://github.com/veeam/veeamsnap/):
> * all conditional compilation branches that served for the purpose of
>    compatibility with older kernels have been removed;
> * linux kernel code style has been applied;
> * blk-snap mostly takes advantage of the existing kernel code instead of
>    reinventing the wheel;
> * all redundant code (such as persistent cbt and snapstore collector)
>    has been removed.
> 
> Several important things are still have to be done:
> * refactor the module interface for interaction with a user-space code -
>    it is already clear that the implementation of some calls can be
>    improved.
> 
> Your feedback would be greatly appreciated!
> 
> Sergei Shtepa (2):
>    Block layer filter - second version
>    blk-snap - snapshots and change-tracking for block devices
> 
>   block/Kconfig                               |  11 +
>   block/Makefile                              |   1 +
>   block/blk-core.c                            |  52 +-
>   block/blk-filter-internal.h                 |  29 +
>   block/blk-filter.c                          | 286 ++++++
>   block/partitions/core.c                     |  14 +-
>   drivers/block/Kconfig                       |   2 +
>   drivers/block/Makefile                      |   1 +
>   drivers/block/blk-snap/Kconfig              |  24 +
>   drivers/block/blk-snap/Makefile             |  28 +
>   drivers/block/blk-snap/big_buffer.c         | 193 ++++
>   drivers/block/blk-snap/big_buffer.h         |  24 +
>   drivers/block/blk-snap/blk-snap-ctl.h       | 190 ++++
>   drivers/block/blk-snap/blk_deferred.c       | 566 +++++++++++
>   drivers/block/blk-snap/blk_deferred.h       |  67 ++
>   drivers/block/blk-snap/blk_descr_file.c     |  82 ++
>   drivers/block/blk-snap/blk_descr_file.h     |  26 +
>   drivers/block/blk-snap/blk_descr_mem.c      |  66 ++
>   drivers/block/blk-snap/blk_descr_mem.h      |  14 +
>   drivers/block/blk-snap/blk_descr_multidev.c |  86 ++
>   drivers/block/blk-snap/blk_descr_multidev.h |  25 +
>   drivers/block/blk-snap/blk_descr_pool.c     | 190 ++++
>   drivers/block/blk-snap/blk_descr_pool.h     |  38 +
>   drivers/block/blk-snap/blk_redirect.c       | 507 ++++++++++
>   drivers/block/blk-snap/blk_redirect.h       |  73 ++
>   drivers/block/blk-snap/blk_util.c           |  33 +
>   drivers/block/blk-snap/blk_util.h           |  33 +
>   drivers/block/blk-snap/cbt_map.c            | 210 +++++
>   drivers/block/blk-snap/cbt_map.h            |  62 ++
>   drivers/block/blk-snap/common.h             |  31 +
>   drivers/block/blk-snap/ctrl_fops.c          | 691 ++++++++++++++
>   drivers/block/blk-snap/ctrl_fops.h          |  19 +
>   drivers/block/blk-snap/ctrl_pipe.c          | 562 +++++++++++
>   drivers/block/blk-snap/ctrl_pipe.h          |  34 +
>   drivers/block/blk-snap/ctrl_sysfs.c         |  73 ++
>   drivers/block/blk-snap/ctrl_sysfs.h         |   5 +
>   drivers/block/blk-snap/defer_io.c           | 397 ++++++++
>   drivers/block/blk-snap/defer_io.h           |  39 +
>   drivers/block/blk-snap/main.c               |  82 ++
>   drivers/block/blk-snap/params.c             |  58 ++
>   drivers/block/blk-snap/params.h             |  29 +
>   drivers/block/blk-snap/rangevector.c        |  85 ++
>   drivers/block/blk-snap/rangevector.h        |  31 +
>   drivers/block/blk-snap/snapimage.c          | 982 ++++++++++++++++++++
>   drivers/block/blk-snap/snapimage.h          |  16 +
>   drivers/block/blk-snap/snapshot.c           | 225 +++++
>   drivers/block/blk-snap/snapshot.h           |  17 +
>   drivers/block/blk-snap/snapstore.c          | 929 ++++++++++++++++++
>   drivers/block/blk-snap/snapstore.h          |  68 ++
>   drivers/block/blk-snap/snapstore_device.c   | 532 +++++++++++
>   drivers/block/blk-snap/snapstore_device.h   |  63 ++
>   drivers/block/blk-snap/snapstore_file.c     |  52 ++
>   drivers/block/blk-snap/snapstore_file.h     |  15 +
>   drivers/block/blk-snap/snapstore_mem.c      |  91 ++
>   drivers/block/blk-snap/snapstore_mem.h      |  20 +
>   drivers/block/blk-snap/snapstore_multidev.c | 118 +++
>   drivers/block/blk-snap/snapstore_multidev.h |  22 +
>   drivers/block/blk-snap/tracker.c            | 449 +++++++++
>   drivers/block/blk-snap/tracker.h            |  38 +
>   drivers/block/blk-snap/tracking.c           | 270 ++++++
>   drivers/block/blk-snap/tracking.h           |  13 +
>   drivers/block/blk-snap/version.h            |   7 +
>   fs/block_dev.c                              |   6 +-
>   fs/direct-io.c                              |   2 +-
>   fs/iomap/direct-io.c                        |   2 +-
>   include/linux/bio.h                         |   4 +-
>   include/linux/blk-filter.h                  |  76 ++
>   include/linux/genhd.h                       |   8 +-
>   kernel/power/swap.c                         |   2 +-
>   mm/page_io.c                                |   4 +-
>   70 files changed, 9074 insertions(+), 26 deletions(-)
>   create mode 100644 block/blk-filter-internal.h
>   create mode 100644 block/blk-filter.c
>   create mode 100644 drivers/block/blk-snap/Kconfig
>   create mode 100644 drivers/block/blk-snap/Makefile
>   create mode 100644 drivers/block/blk-snap/big_buffer.c
>   create mode 100644 drivers/block/blk-snap/big_buffer.h
>   create mode 100644 drivers/block/blk-snap/blk-snap-ctl.h
>   create mode 100644 drivers/block/blk-snap/blk_deferred.c
>   create mode 100644 drivers/block/blk-snap/blk_deferred.h
>   create mode 100644 drivers/block/blk-snap/blk_descr_file.c
>   create mode 100644 drivers/block/blk-snap/blk_descr_file.h
>   create mode 100644 drivers/block/blk-snap/blk_descr_mem.c
>   create mode 100644 drivers/block/blk-snap/blk_descr_mem.h
>   create mode 100644 drivers/block/blk-snap/blk_descr_multidev.c
>   create mode 100644 drivers/block/blk-snap/blk_descr_multidev.h
>   create mode 100644 drivers/block/blk-snap/blk_descr_pool.c
>   create mode 100644 drivers/block/blk-snap/blk_descr_pool.h
>   create mode 100644 drivers/block/blk-snap/blk_redirect.c
>   create mode 100644 drivers/block/blk-snap/blk_redirect.h
>   create mode 100644 drivers/block/blk-snap/blk_util.c
>   create mode 100644 drivers/block/blk-snap/blk_util.h
>   create mode 100644 drivers/block/blk-snap/cbt_map.c
>   create mode 100644 drivers/block/blk-snap/cbt_map.h
>   create mode 100644 drivers/block/blk-snap/common.h
>   create mode 100644 drivers/block/blk-snap/ctrl_fops.c
>   create mode 100644 drivers/block/blk-snap/ctrl_fops.h
>   create mode 100644 drivers/block/blk-snap/ctrl_pipe.c
>   create mode 100644 drivers/block/blk-snap/ctrl_pipe.h
>   create mode 100644 drivers/block/blk-snap/ctrl_sysfs.c
>   create mode 100644 drivers/block/blk-snap/ctrl_sysfs.h
>   create mode 100644 drivers/block/blk-snap/defer_io.c
>   create mode 100644 drivers/block/blk-snap/defer_io.h
>   create mode 100644 drivers/block/blk-snap/main.c
>   create mode 100644 drivers/block/blk-snap/params.c
>   create mode 100644 drivers/block/blk-snap/params.h
>   create mode 100644 drivers/block/blk-snap/rangevector.c
>   create mode 100644 drivers/block/blk-snap/rangevector.h
>   create mode 100644 drivers/block/blk-snap/snapimage.c
>   create mode 100644 drivers/block/blk-snap/snapimage.h
>   create mode 100644 drivers/block/blk-snap/snapshot.c
>   create mode 100644 drivers/block/blk-snap/snapshot.h
>   create mode 100644 drivers/block/blk-snap/snapstore.c
>   create mode 100644 drivers/block/blk-snap/snapstore.h
>   create mode 100644 drivers/block/blk-snap/snapstore_device.c
>   create mode 100644 drivers/block/blk-snap/snapstore_device.h
>   create mode 100644 drivers/block/blk-snap/snapstore_file.c
>   create mode 100644 drivers/block/blk-snap/snapstore_file.h
>   create mode 100644 drivers/block/blk-snap/snapstore_mem.c
>   create mode 100644 drivers/block/blk-snap/snapstore_mem.h
>   create mode 100644 drivers/block/blk-snap/snapstore_multidev.c
>   create mode 100644 drivers/block/blk-snap/snapstore_multidev.h
>   create mode 100644 drivers/block/blk-snap/tracker.c
>   create mode 100644 drivers/block/blk-snap/tracker.h
>   create mode 100644 drivers/block/blk-snap/tracking.c
>   create mode 100644 drivers/block/blk-snap/tracking.h
>   create mode 100644 drivers/block/blk-snap/version.h
>   create mode 100644 include/linux/blk-filter.h
> 
> --
> 2.20.1
> 
I do understand where you are coming from, but then we already have a 
dm-snap which does exactly what you want to achieve.
Of course, that would require a reconfiguration of the storage stack on 
the machine, which is not always possible (or desired).

What I _could_ imagine would be a 'dm-intercept' thingie, which 
redirects the current submit_bio() function for any block device, and 
re-routes that to a linear device-mapper device pointing back to the 
original block device.

That way you could attach it to basically any block device, _and_ can 
use the existing device-mapper functionality to do fancy stuff once the 
submit_io() callback has been re-routed.

And it also would help in other scenarios, too; with such a 
functionality we could seamlessly clone devices without having to move 
the whole setup to device-mapper first.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-21 13:31 ` [PATCH 0/2] block layer filter and block device snapshot module Hannes Reinecke
@ 2020-10-21 14:10   ` Sergei Shtepa
  2020-10-22  5:58     ` Hannes Reinecke
  0 siblings, 1 reply; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 14:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

The 10/21/2020 16:31, Hannes Reinecke wrote:
> I do understand where you are coming from, but then we already have a 
> dm-snap which does exactly what you want to achieve.
> Of course, that would require a reconfiguration of the storage stack on 
> the machine, which is not always possible (or desired).

Yes, reconfiguring the storage stack on a machine is almost impossible.

> 
> What I _could_ imagine would be a 'dm-intercept' thingie, which 
> redirects the current submit_bio() function for any block device, and 
> re-routes that to a linear device-mapper device pointing back to the 
> original block device.
> 
> That way you could attach it to basically any block device, _and_ can 
> use the existing device-mapper functionality to do fancy stuff once the 
> submit_io() callback has been re-routed.
> 
> And it also would help in other scenarios, too; with such a 
> functionality we could seamlessly clone devices without having to move 
> the whole setup to device-mapper first.

Hm... 
Did I understand correctly that the filter itself can be left approximately
as it is, but the blk-snap module can be replaced with 'dm-intercept',
which would use the re-route mechanism from the dm?
I think I may be able to implement it, if you describe your idea in more
detail.


-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21 13:07         ` Matthew Wilcox
@ 2020-10-21 14:35           ` Sergei Shtepa
  0 siblings, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-21 14:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Damien Le Moal, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

The 10/21/2020 16:07, Matthew Wilcox wrote:
> On Wed, Oct 21, 2020 at 03:55:55PM +0300, Sergei Shtepa wrote:
> > The 10/21/2020 14:44, Matthew Wilcox wrote:
> > > I don't understand why O_DIRECT gets to bypass the block filter.  Nor do
> > > I understand why anybody would place a block filter on the swap device.
> > > But if somebody did place a filter on the swap device, why should swap
> > > be able to bypass the filter?
> > 
> > Yes, intercepting the swap partition is absurd. But we can't guarantee
> > that the filter won't intercept swap.
> > 
> > Swap operation is related to the memory allocation logic. If a swap on
> > the block device are accessed during memory allocation from filter,
> > a deadlock occurs. We can allow filters to occasionally shoot off their
> > feet, especially under high load. But I think it's better not to do it.
> 
> We already have logic to prevent this in Linux.  Filters need to
> call memalloc_noio_save() while they might cause swap to happen and
> memalloc_noio_restore() once it's safe for them to cause swap again.

Yes, I looked at this function, it can really be useful for the filter.
Then I don't need to enter the submit_bio_direct() function and the wait
loop associated with the queue polling function blk_mq_poll() will have
to be rewritten.

> 
> > "directly access" - it is not O_DIRECT. This means (I think) direct
> > reading from the device file, like "dd if=/dev/sda1".
> > As for intercepting direct reading, I don't know how to do the right thing.
> > 
> > The problem here is that in fs/block_dev.c in function __blkdev_direct_IO()
> > uses the qc - value returned by the submit_bio() function.
> > This value is used below when calling 
> > blk_poll(bdev_get_queue(dev), qc, true).
> > The filter cannot return a meaningful value of the blk_qc_t type when
> > intercepting a request, because at that time it does not know which queue
> > the request will fall into.
> > 
> > If function submit_bio() will always return BLK_QC_T_NONE - I think the
> > algorithm of the __blk dev_direct_IO() will not work correctly.
> > If we need to intercept direct access to a block device, we need to at
> > least redo the __blkdev_direct_IO function, getting rid of blk_pool.
> > I'm not sure it's necessary yet.
> 
> This isn't part of the block layer that I'm familiar with, so I can't
> help solve this problem, but allowing O_DIRECT to bypass the block filter
> is a hole that needs to be fixed before these patches can be considered.

I think there is no such problem, but I will check, of course.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
  2020-10-21  9:14   ` Johannes Thumshirn
  2020-10-21  9:21   ` Damien Le Moal
@ 2020-10-21 15:09   ` Randy Dunlap
  2020-10-24 14:53   ` Greg KH
  3 siblings, 0 replies; 29+ messages in thread
From: Randy Dunlap @ 2020-10-21 15:09 UTC (permalink / raw)
  To: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, johannes.thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i,
	damien.lemoal, steve, linux-block, linux-kernel, linux-pm,
	linux-mm

On 10/21/20 2:04 AM, Sergei Shtepa wrote:
> diff --git a/block/Kconfig b/block/Kconfig
> index bbad5e8bbffe..a308801b4376 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -204,6 +204,17 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
>  	  by falling back to the kernel crypto API when inline
>  	  encryption hardware is not present.
>  
> +config BLK_FILTER
> +	bool "Enable support for block layer filters"
> +	default y

Drop the default y. We don't add modules to a default build without
some tough justification.

> +	depends on MODULES
> +	help
> +	  Enabling this lets third-party kernel modules intercept

	                lets loadable kernel modules intercept

> +	  bio requests for any block device. This allows them to implement
> +	  changed block tracking and snapshots without any reconfiguration of
> +	  the existing setup. For example, this option allows snapshotting of
> +	  a block device without adding it to LVM.


-- 
~Randy


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices
       [not found] ` <1603271049-20681-3-git-send-email-sergei.shtepa@veeam.com>
  2020-10-21  9:08   ` [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices Pavel Machek
       [not found]   ` <BL0PR04MB65142D9F391FE8777F096EF5E71C0@BL0PR04MB6514.namprd04.prod.outlook.com>
@ 2020-10-21 15:11   ` Randy Dunlap
  2 siblings, 0 replies; 29+ messages in thread
From: Randy Dunlap @ 2020-10-21 15:11 UTC (permalink / raw)
  To: Sergei Shtepa, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, johannes.thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i,
	damien.lemoal, steve, linux-block, linux-kernel, linux-pm,
	linux-mm

On 10/21/20 2:04 AM, Sergei Shtepa wrote:
> diff --git a/drivers/block/blk-snap/Kconfig b/drivers/block/blk-snap/Kconfig
> new file mode 100644
> index 000000000000..7a2db99a80dd
> --- /dev/null
> +++ b/drivers/block/blk-snap/Kconfig
> @@ -0,0 +1,24 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# blk-snap block io layer filter module configuration
> +#
> +#
> +#select BLK_FILTER
> +
> +config BLK_SNAP
> +	tristate "Block device snapshot filter"
> +	depends on BLK_FILTER
> +	help
> +

No blank line here.

> +	  Allow to create snapshots and track block changes for a block

	                                                    for block

> +	  devices. Designed for creating backups for any block devices
> +	  (without device mapper). Snapshots are temporary and are released
> +	  then backup is completed. Change block tracking allows you to

	  when

> +	  create incremental or differential backups.
> +
> +config BLK_SNAP_SNAPSTORE_MULTIDEV
> +	bool "Multi device snapstore configuration support"
> +	depends on BLK_SNAP
> +	help
> +
No blank line here.

> +	  Allow to create snapstore on multiple block devices.


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-21 14:10   ` Sergei Shtepa
@ 2020-10-22  5:58     ` Hannes Reinecke
  2020-10-22  9:44       ` Sergei Shtepa
  0 siblings, 1 reply; 29+ messages in thread
From: Hannes Reinecke @ 2020-10-22  5:58 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On 10/21/20 4:10 PM, Sergei Shtepa wrote:
> The 10/21/2020 16:31, Hannes Reinecke wrote:
>> I do understand where you are coming from, but then we already have a
>> dm-snap which does exactly what you want to achieve.
>> Of course, that would require a reconfiguration of the storage stack on
>> the machine, which is not always possible (or desired).
> 
> Yes, reconfiguring the storage stack on a machine is almost impossible.
> 
>>
>> What I _could_ imagine would be a 'dm-intercept' thingie, which
>> redirects the current submit_bio() function for any block device, and
>> re-routes that to a linear device-mapper device pointing back to the
>> original block device.
>>
>> That way you could attach it to basically any block device, _and_ can
>> use the existing device-mapper functionality to do fancy stuff once the
>> submit_io() callback has been re-routed.
>>
>> And it also would help in other scenarios, too; with such a
>> functionality we could seamlessly clone devices without having to move
>> the whole setup to device-mapper first.
> 
> Hm...
> Did I understand correctly that the filter itself can be left approximately
> as it is, but the blk-snap module can be replaced with 'dm-intercept',
> which would use the re-route mechanism from the dm?
> I think I may be able to implement it, if you describe your idea in more
> detail.
> 
> 
Actually, once we have an dm-intercept, why do you need the block-layer 
filter at all?
 From you initial description the block-layer filter was implemented 
such that blk-snap could work; but if we have dm-intercept (and with it 
the ability to use device-mapper functionality even for normal block 
devices) there wouldn't be any need for the block-layer filter, no?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22  5:58     ` Hannes Reinecke
@ 2020-10-22  9:44       ` Sergei Shtepa
  2020-10-22 10:28         ` Damien Le Moal
  0 siblings, 1 reply; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-22  9:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

The 10/22/2020 08:58, Hannes Reinecke wrote:
> On 10/21/20 4:10 PM, Sergei Shtepa wrote:
> > The 10/21/2020 16:31, Hannes Reinecke wrote:
> >> I do understand where you are coming from, but then we already have a
> >> dm-snap which does exactly what you want to achieve.
> >> Of course, that would require a reconfiguration of the storage stack on
> >> the machine, which is not always possible (or desired).
> > 
> > Yes, reconfiguring the storage stack on a machine is almost impossible.
> > 
> >>
> >> What I _could_ imagine would be a 'dm-intercept' thingie, which
> >> redirects the current submit_bio() function for any block device, and
> >> re-routes that to a linear device-mapper device pointing back to the
> >> original block device.
> >>
> >> That way you could attach it to basically any block device, _and_ can
> >> use the existing device-mapper functionality to do fancy stuff once the
> >> submit_io() callback has been re-routed.
> >>
> >> And it also would help in other scenarios, too; with such a
> >> functionality we could seamlessly clone devices without having to move
> >> the whole setup to device-mapper first.
> > 
> > Hm...
> > Did I understand correctly that the filter itself can be left approximately
> > as it is, but the blk-snap module can be replaced with 'dm-intercept',
> > which would use the re-route mechanism from the dm?
> > I think I may be able to implement it, if you describe your idea in more
> > detail.
> > 
> > 
> Actually, once we have an dm-intercept, why do you need the block-layer 
> filter at all?
>  From you initial description the block-layer filter was implemented 
> such that blk-snap could work; but if we have dm-intercept (and with it 
> the ability to use device-mapper functionality even for normal block 
> devices) there wouldn't be any need for the block-layer filter, no?

Maybe, but the problem is that I can't imagine how to implement
dm-intercept yet. 
How to use dm to implement interception without changing the stack
of block devices. We'll have to make a hook somewhere, isn`t it?

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22  9:44       ` Sergei Shtepa
@ 2020-10-22 10:28         ` Damien Le Moal
  2020-10-22 13:52           ` Sergei Shtepa
  0 siblings, 1 reply; 29+ messages in thread
From: Damien Le Moal @ 2020-10-22 10:28 UTC (permalink / raw)
  To: Sergei Shtepa, Hannes Reinecke
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, Johannes Thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, steve, linux-block,
	linux-kernel, linux-pm, linux-mm

On 2020/10/22 18:43, Sergei Shtepa wrote:
> The 10/22/2020 08:58, Hannes Reinecke wrote:
>> On 10/21/20 4:10 PM, Sergei Shtepa wrote:
>>> The 10/21/2020 16:31, Hannes Reinecke wrote:
>>>> I do understand where you are coming from, but then we already have a
>>>> dm-snap which does exactly what you want to achieve.
>>>> Of course, that would require a reconfiguration of the storage stack on
>>>> the machine, which is not always possible (or desired).
>>>
>>> Yes, reconfiguring the storage stack on a machine is almost impossible.
>>>
>>>>
>>>> What I _could_ imagine would be a 'dm-intercept' thingie, which
>>>> redirects the current submit_bio() function for any block device, and
>>>> re-routes that to a linear device-mapper device pointing back to the
>>>> original block device.
>>>>
>>>> That way you could attach it to basically any block device, _and_ can
>>>> use the existing device-mapper functionality to do fancy stuff once the
>>>> submit_io() callback has been re-routed.
>>>>
>>>> And it also would help in other scenarios, too; with such a
>>>> functionality we could seamlessly clone devices without having to move
>>>> the whole setup to device-mapper first.
>>>
>>> Hm...
>>> Did I understand correctly that the filter itself can be left approximately
>>> as it is, but the blk-snap module can be replaced with 'dm-intercept',
>>> which would use the re-route mechanism from the dm?
>>> I think I may be able to implement it, if you describe your idea in more
>>> detail.
>>>
>>>
>> Actually, once we have an dm-intercept, why do you need the block-layer 
>> filter at all?
>>  From you initial description the block-layer filter was implemented 
>> such that blk-snap could work; but if we have dm-intercept (and with it 
>> the ability to use device-mapper functionality even for normal block 
>> devices) there wouldn't be any need for the block-layer filter, no?
> 
> Maybe, but the problem is that I can't imagine how to implement
> dm-intercept yet. 
> How to use dm to implement interception without changing the stack
> of block devices. We'll have to make a hook somewhere, isn`t it?

Once your dm-intercept target driver is inserted with "dmsetup" or any user land
tool you implement using libdevicemapper, the "hooks" will naturally be in place
since the dm infrastructure already does that: all submitted BIOs will be passed
to dm-intercept through the "map" operation defined in the target_type
descriptor. It is then that driver job to execute the BIOs as it sees fit.

Look at simple device mappers like dm-linear or dm-flakey for hints of how
things work (driver/md/dm-linear.c). More complex dm drivers like dm-crypt,
dm-writecache or dm-thin can give you hints about more features of device mapper.
Functions such as __map_bio() in drivers/md/dm.c are the core of DM and show
what happens to BIOs depending on the the return value of the map operation.
dm_submit_bio() and __split_and_process_bio() is the entry points for BIO
processing in DM.

> 
>>
>> Cheers,
>>
>> Hannes
>> -- 
>> Dr. Hannes Reinecke                Kernel Storage Architect
>> hare@suse.de                              +49 911 74053 688
>> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
>> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22 10:28         ` Damien Le Moal
@ 2020-10-22 13:52           ` Sergei Shtepa
  2020-10-22 15:14             ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-22 13:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, axboe, viro, hch, darrick.wong, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

The 10/22/2020 13:28, Damien Le Moal wrote:
> On 2020/10/22 18:43, Sergei Shtepa wrote:
> > 
> > Maybe, but the problem is that I can't imagine how to implement
> > dm-intercept yet. 
> > How to use dm to implement interception without changing the stack
> > of block devices. We'll have to make a hook somewhere, isn`t it?
> 
> Once your dm-intercept target driver is inserted with "dmsetup" or any user land
> tool you implement using libdevicemapper, the "hooks" will naturally be in place
> since the dm infrastructure already does that: all submitted BIOs will be passed
> to dm-intercept through the "map" operation defined in the target_type
> descriptor. It is then that driver job to execute the BIOs as it sees fit.
> 
> Look at simple device mappers like dm-linear or dm-flakey for hints of how
> things work (driver/md/dm-linear.c). More complex dm drivers like dm-crypt,
> dm-writecache or dm-thin can give you hints about more features of device mapper.
> Functions such as __map_bio() in drivers/md/dm.c are the core of DM and show
> what happens to BIOs depending on the the return value of the map operation.
> dm_submit_bio() and __split_and_process_bio() is the entry points for BIO
> processing in DM.
> 

Is there something I don't understand? Please correct me.

Let me remind that by the condition of the problem, we can't change
the configuration of the block device stack.

Let's imagine this configuration: /root mount point on ext filesystem
on /dev/sda1.
+---------------+
|               |
|  /root        |
|               |
+---------------+
|               |
| EXT FS        |
|               |
+---------------+
|               |
| block layer   |
|               |
| sda queue     |
|               |
+---------------+
|               |
| scsi driver   |
|               |
+---------------+

We need to add change block tracking (CBT) and snapshot functionality for
incremental backup.

With the DM we need to change the block device stack. Add device /dev/sda1
to LVM Volume group, create logical volume, change /etc/fstab and reboot.

The new scheme will look like this:
+---------------+
|               |
|  /root        |
|               |
+---------------+
|               |
| EXT FS        |
|               |
+---------------+
|               |
| LV-root       |
|               |
+------------------+
|                  |
| dm-cbt & dm-snap |
|                  |
+------------------+
|               |
| sda queue     |
|               |
+---------------+
|               |
| scsi driver   |
|               |
+---------------+

But I cannot change block device stack. And so I propose a scheme with
interception.
+---------------+
|               |
|  /root        |
|               |
+---------------+
|               |
| EXT FS        |
|               |
+---------------+   +-----------------+
|  |            |   |                 |
|  | blk-filter |-> | cbt & snapshot  |
|  |            |<- |                 |
|  +------------+   +-----------------+
|               |
| sda blk queue |
|               |
+---------------+
|               |
| scsi driver   |
|               |
+---------------+

Perhaps I can make "cbt & snapshot" inside the DM, but without interception
in any case, it will not work. Isn't that right?

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22 13:52           ` Sergei Shtepa
@ 2020-10-22 15:14             ` Darrick J. Wong
  2020-10-22 17:54               ` Mike Snitzer
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-10-22 15:14 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Damien Le Moal, Hannes Reinecke, axboe, viro, hch, linux-xfs,
	linux-fsdevel, rjw, len.brown, pavel, akpm, Johannes Thumshirn,
	ming.lei, jack, tj, gustavo, bvanassche, osandov, koct9i, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On Thu, Oct 22, 2020 at 04:52:13PM +0300, Sergei Shtepa wrote:
> The 10/22/2020 13:28, Damien Le Moal wrote:
> > On 2020/10/22 18:43, Sergei Shtepa wrote:
> > > 
> > > Maybe, but the problem is that I can't imagine how to implement
> > > dm-intercept yet. 
> > > How to use dm to implement interception without changing the stack
> > > of block devices. We'll have to make a hook somewhere, isn`t it?
> > 
> > Once your dm-intercept target driver is inserted with "dmsetup" or any user land
> > tool you implement using libdevicemapper, the "hooks" will naturally be in place
> > since the dm infrastructure already does that: all submitted BIOs will be passed
> > to dm-intercept through the "map" operation defined in the target_type
> > descriptor. It is then that driver job to execute the BIOs as it sees fit.
> > 
> > Look at simple device mappers like dm-linear or dm-flakey for hints of how
> > things work (driver/md/dm-linear.c). More complex dm drivers like dm-crypt,
> > dm-writecache or dm-thin can give you hints about more features of device mapper.
> > Functions such as __map_bio() in drivers/md/dm.c are the core of DM and show
> > what happens to BIOs depending on the the return value of the map operation.
> > dm_submit_bio() and __split_and_process_bio() is the entry points for BIO
> > processing in DM.
> > 
> 
> Is there something I don't understand? Please correct me.
> 
> Let me remind that by the condition of the problem, we can't change
> the configuration of the block device stack.
> 
> Let's imagine this configuration: /root mount point on ext filesystem
> on /dev/sda1.
> +---------------+
> |               |
> |  /root        |
> |               |
> +---------------+
> |               |
> | EXT FS        |
> |               |
> +---------------+
> |               |
> | block layer   |
> |               |
> | sda queue     |
> |               |
> +---------------+
> |               |
> | scsi driver   |
> |               |
> +---------------+
> 
> We need to add change block tracking (CBT) and snapshot functionality for
> incremental backup.
> 
> With the DM we need to change the block device stack. Add device /dev/sda1
> to LVM Volume group, create logical volume, change /etc/fstab and reboot.
> 
> The new scheme will look like this:
> +---------------+
> |               |
> |  /root        |
> |               |
> +---------------+
> |               |
> | EXT FS        |
> |               |
> +---------------+
> |               |
> | LV-root       |
> |               |
> +------------------+
> |                  |
> | dm-cbt & dm-snap |
> |                  |
> +------------------+
> |               |
> | sda queue     |
> |               |
> +---------------+
> |               |
> | scsi driver   |
> |               |
> +---------------+
> 
> But I cannot change block device stack. And so I propose a scheme with
> interception.
> +---------------+
> |               |
> |  /root        |
> |               |
> +---------------+
> |               |
> | EXT FS        |
> |               |
> +---------------+   +-----------------+
> |  |            |   |                 |
> |  | blk-filter |-> | cbt & snapshot  |
> |  |            |<- |                 |
> |  +------------+   +-----------------+
> |               |
> | sda blk queue |
> |               |
> +---------------+
> |               |
> | scsi driver   |
> |               |
> +---------------+
> 
> Perhaps I can make "cbt & snapshot" inside the DM, but without interception
> in any case, it will not work. Isn't that right?

Stupid question: Why don't you change the block layer to make it
possible to insert device mapper devices after the blockdev has been set
up?

--D

> 
> -- 
> Sergei Shtepa
> Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22 15:14             ` Darrick J. Wong
@ 2020-10-22 17:54               ` Mike Snitzer
  2020-10-23  9:13                 ` hch
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Snitzer @ 2020-10-22 17:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Sergei Shtepa, Damien Le Moal, Hannes Reinecke, axboe, viro, hch,
	linux-xfs, linux-fsdevel, rjw, len.brown, pavel, akpm,
	Johannes Thumshirn, ming.lei, jack, tj, gustavo, bvanassche,
	osandov, koct9i, steve, linux-block, linux-kernel, linux-pm,
	linux-mm, device-mapper development, Alasdair G Kergon

On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Thu, Oct 22, 2020 at 04:52:13PM +0300, Sergei Shtepa wrote:
> > The 10/22/2020 13:28, Damien Le Moal wrote:
> > > On 2020/10/22 18:43, Sergei Shtepa wrote:
> > > >
> > > > Maybe, but the problem is that I can't imagine how to implement
> > > > dm-intercept yet.
> > > > How to use dm to implement interception without changing the stack
> > > > of block devices. We'll have to make a hook somewhere, isn`t it?
> > >
> > > Once your dm-intercept target driver is inserted with "dmsetup" or any user land
> > > tool you implement using libdevicemapper, the "hooks" will naturally be in place
> > > since the dm infrastructure already does that: all submitted BIOs will be passed
> > > to dm-intercept through the "map" operation defined in the target_type
> > > descriptor. It is then that driver job to execute the BIOs as it sees fit.
> > >
> > > Look at simple device mappers like dm-linear or dm-flakey for hints of how
> > > things work (driver/md/dm-linear.c). More complex dm drivers like dm-crypt,
> > > dm-writecache or dm-thin can give you hints about more features of device mapper.
> > > Functions such as __map_bio() in drivers/md/dm.c are the core of DM and show
> > > what happens to BIOs depending on the the return value of the map operation.
> > > dm_submit_bio() and __split_and_process_bio() is the entry points for BIO
> > > processing in DM.
> > >
> >
> > Is there something I don't understand? Please correct me.
> >
> > Let me remind that by the condition of the problem, we can't change
> > the configuration of the block device stack.
> >
> > Let's imagine this configuration: /root mount point on ext filesystem
> > on /dev/sda1.
> > +---------------+
> > |               |
> > |  /root        |
> > |               |
> > +---------------+
> > |               |
> > | EXT FS        |
> > |               |
> > +---------------+
> > |               |
> > | block layer   |
> > |               |
> > | sda queue     |
> > |               |
> > +---------------+
> > |               |
> > | scsi driver   |
> > |               |
> > +---------------+
> >
> > We need to add change block tracking (CBT) and snapshot functionality for
> > incremental backup.
> >
> > With the DM we need to change the block device stack. Add device /dev/sda1
> > to LVM Volume group, create logical volume, change /etc/fstab and reboot.
> >
> > The new scheme will look like this:
> > +---------------+
> > |               |
> > |  /root        |
> > |               |
> > +---------------+
> > |               |
> > | EXT FS        |
> > |               |
> > +---------------+
> > |               |
> > | LV-root       |
> > |               |
> > +------------------+
> > |                  |
> > | dm-cbt & dm-snap |
> > |                  |
> > +------------------+
> > |               |
> > | sda queue     |
> > |               |
> > +---------------+
> > |               |
> > | scsi driver   |
> > |               |
> > +---------------+
> >
> > But I cannot change block device stack. And so I propose a scheme with
> > interception.
> > +---------------+
> > |               |
> > |  /root        |
> > |               |
> > +---------------+
> > |               |
> > | EXT FS        |
> > |               |
> > +---------------+   +-----------------+
> > |  |            |   |                 |
> > |  | blk-filter |-> | cbt & snapshot  |
> > |  |            |<- |                 |
> > |  +------------+   +-----------------+
> > |               |
> > | sda blk queue |
> > |               |
> > +---------------+
> > |               |
> > | scsi driver   |
> > |               |
> > +---------------+
> >
> > Perhaps I can make "cbt & snapshot" inside the DM, but without interception
> > in any case, it will not work. Isn't that right?
>
> Stupid question: Why don't you change the block layer to make it
> possible to insert device mapper devices after the blockdev has been set
> up?

Not a stupid question.  Definitely something that us DM developers
have wanted to do for a while.  Devil is in the details but it is the
right way forward.

Otherwise, this intercept is really just a DM-lite remapping layer
without any of DM's well established capabilities.  Encouragingly, all
of the replies have effectively echoed this point.  (amusingly, seems
every mailing list under the sun is on the cc except dm-devel... now
rectified)

Alasdair has some concrete ideas on this line of work; I'm trying to
encourage him to reply ;)

Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-21  9:04 [PATCH 0/2] block layer filter and block device snapshot module Sergei Shtepa
                   ` (2 preceding siblings ...)
  2020-10-21 13:31 ` [PATCH 0/2] block layer filter and block device snapshot module Hannes Reinecke
@ 2020-10-22 18:35 ` Mike Snitzer
  3 siblings, 0 replies; 29+ messages in thread
From: Mike Snitzer @ 2020-10-22 18:35 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: Jens Axboe, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	linux-xfs, linux-fsdevel, rjw, len.brown, Pavel Machek,
	Andrew Morton, johannes.thumshirn, Ming Lei, Jan Kara, Tejun Heo,
	gustavo, Bart Van Assche, osandov, koct9i, Damien Le Moal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm,
	device-mapper development

On Wed, Oct 21, 2020 at 5:04 AM Sergei Shtepa <sergei.shtepa@veeam.com> wrote:
>
> Hello everyone! Requesting for your comments and suggestions.
>
> # blk-filter
>
> Block layer filter allows to intercept BIO requests to a block device.
>
> Interception is performed at the very beginning of the BIO request
> processing, and therefore does not affect the operation of the request
> processing queue. This also makes it possible to intercept requests from
> a specific block device, rather than from the entire disk.
>
> The logic of the submit_bio function has been changed - since the
> function execution results are not processed anywhere (except for swap
> and direct-io) the function won't return a value anymore.

Your desire to switch to a void return comes exactly when I've noticed
we need it.

->submit_bio's blk_qc_t return is the cookie assigned by blk-mq.  Up
to this point we haven't actually used it for bio-based devices but it
seems clear we'll soon need for bio-based IO polling support.

Just today, I've been auditing drivers/md/dm.c with an eye toward
properly handling the blk_qc_t return (or lack thereof) from various
DM methods.

It could easily be that __submit_bio_noacct and __submit_bio_noacct_mq
will be updated to do something meaningful with the returned cookie
(or that DM will) to facilitate proper IO polling.

Mike

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-22 17:54               ` Mike Snitzer
@ 2020-10-23  9:13                 ` hch
  2020-10-23 10:31                   ` Hannes Reinecke
  0 siblings, 1 reply; 29+ messages in thread
From: hch @ 2020-10-23  9:13 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Darrick J. Wong, Sergei Shtepa, Damien Le Moal, Hannes Reinecke,
	axboe, viro, hch, linux-xfs, linux-fsdevel, rjw, len.brown,
	pavel, akpm, Johannes Thumshirn, ming.lei, jack, tj, gustavo,
	bvanassche, osandov, koct9i, steve, linux-block, linux-kernel,
	linux-pm, linux-mm, device-mapper development, Alasdair G Kergon

On Thu, Oct 22, 2020 at 01:54:16PM -0400, Mike Snitzer wrote:
> On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
> > Stupid question: Why don't you change the block layer to make it
> > possible to insert device mapper devices after the blockdev has been set
> > up?
> 
> Not a stupid question.  Definitely something that us DM developers
> have wanted to do for a while.  Devil is in the details but it is the
> right way forward.
> 

Yes, I think that is the right thing to do.  And I don't think it should
be all that hard.  All we'd need in the I/O path is something like the
pseudo-patch below, which will allow the interposer driver to resubmit
bios using submit_bio_noacct as long as the driver sets BIO_INTERPOSED.

diff --git a/block/blk-core.c b/block/blk-core.c
index ac00d2fa4eb48d..3f6f1eb565e0a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1051,6 +1051,9 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
 		return BLK_QC_T_NONE;
 	}
 
+	if (blk_has_interposer(bio->bi_disk) &&
+	    !(bio->bi_flags & BIO_INTERPOSED))
+		return __submit_bio_interposed(bio);
 	if (!bio->bi_disk->fops->submit_bio)
 		return __submit_bio_noacct_mq(bio);
 	return __submit_bio_noacct(bio);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-23  9:13                 ` hch
@ 2020-10-23 10:31                   ` Hannes Reinecke
  2020-10-23 11:04                     ` Sergei Shtepa
  2020-10-23 11:12                     ` [dm-devel] " hch
  0 siblings, 2 replies; 29+ messages in thread
From: Hannes Reinecke @ 2020-10-23 10:31 UTC (permalink / raw)
  To: hch, Mike Snitzer
  Cc: Darrick J. Wong, Sergei Shtepa, Damien Le Moal, axboe, viro,
	linux-xfs, linux-fsdevel, rjw, len.brown, pavel, akpm,
	Johannes Thumshirn, ming.lei, jack, tj, gustavo, bvanassche,
	osandov, koct9i, steve, linux-block, linux-kernel, linux-pm,
	linux-mm, device-mapper development, Alasdair G Kergon

On 10/23/20 11:13 AM, hch@infradead.org wrote:
> On Thu, Oct 22, 2020 at 01:54:16PM -0400, Mike Snitzer wrote:
>> On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
>>> Stupid question: Why don't you change the block layer to make it
>>> possible to insert device mapper devices after the blockdev has been set
>>> up?
>>
>> Not a stupid question.  Definitely something that us DM developers
>> have wanted to do for a while.  Devil is in the details but it is the
>> right way forward.
>>
> 
> Yes, I think that is the right thing to do.  And I don't think it should
> be all that hard.  All we'd need in the I/O path is something like the
> pseudo-patch below, which will allow the interposer driver to resubmit
> bios using submit_bio_noacct as long as the driver sets BIO_INTERPOSED.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ac00d2fa4eb48d..3f6f1eb565e0a8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1051,6 +1051,9 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
>   		return BLK_QC_T_NONE;
>   	}
>   
> +	if (blk_has_interposer(bio->bi_disk) &&
> +	    !(bio->bi_flags & BIO_INTERPOSED))
> +		return __submit_bio_interposed(bio);
>   	if (!bio->bi_disk->fops->submit_bio)
>   		return __submit_bio_noacct_mq(bio);
>   	return __submit_bio_noacct(bio);
> 
My thoughts went more into the direction of hooking into ->submit_bio, 
seeing that it's a NULL pointer for most (all?) block drivers.

But sure, I'll check how the interposer approach would turn out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-23 10:31                   ` Hannes Reinecke
@ 2020-10-23 11:04                     ` Sergei Shtepa
  2020-10-23 11:12                     ` [dm-devel] " hch
  1 sibling, 0 replies; 29+ messages in thread
From: Sergei Shtepa @ 2020-10-23 11:04 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: hch, Mike Snitzer, Darrick J. Wong, Damien Le Moal, axboe, viro,
	linux-xfs, linux-fsdevel, rjw, len.brown, pavel, akpm,
	Johannes Thumshirn, ming.lei, jack, tj, gustavo, bvanassche,
	osandov, koct9i, steve, linux-block, linux-kernel, linux-pm,
	linux-mm, device-mapper development, Alasdair G Kergon

The 10/23/2020 13:31, Hannes Reinecke wrote:
> On 10/23/20 11:13 AM, hch@infradead.org wrote:
> > On Thu, Oct 22, 2020 at 01:54:16PM -0400, Mike Snitzer wrote:
> >> On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
> >>> Stupid question: Why don't you change the block layer to make it
> >>> possible to insert device mapper devices after the blockdev has been set
> >>> up?
> >>
> >> Not a stupid question.  Definitely something that us DM developers
> >> have wanted to do for a while.  Devil is in the details but it is the
> >> right way forward.
> >>
> > 
> > Yes, I think that is the right thing to do.  And I don't think it should
> > be all that hard.  All we'd need in the I/O path is something like the
> > pseudo-patch below, which will allow the interposer driver to resubmit
> > bios using submit_bio_noacct as long as the driver sets BIO_INTERPOSED.
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index ac00d2fa4eb48d..3f6f1eb565e0a8 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1051,6 +1051,9 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
> >   		return BLK_QC_T_NONE;
> >   	}
> >   
> > +	if (blk_has_interposer(bio->bi_disk) &&
> > +	    !(bio->bi_flags & BIO_INTERPOSED))
> > +		return __submit_bio_interposed(bio);
> >   	if (!bio->bi_disk->fops->submit_bio)
> >   		return __submit_bio_noacct_mq(bio);
> >   	return __submit_bio_noacct(bio);
> > 

It`s will be great! Approximately this interception capability is not
enough now.

> My thoughts went more into the direction of hooking into ->submit_bio, 
> seeing that it's a NULL pointer for most (all?) block drivers.
> 
> But sure, I'll check how the interposer approach would turn out.

If anyone will do the patch blk-interposer, please add me to CC.
I will try to offer my module that will use blk-interposer.

-- 
Sergei Shtepa
Veeam Software developer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [dm-devel] [PATCH 0/2] block layer filter and block device snapshot module
  2020-10-23 10:31                   ` Hannes Reinecke
  2020-10-23 11:04                     ` Sergei Shtepa
@ 2020-10-23 11:12                     ` hch
  1 sibling, 0 replies; 29+ messages in thread
From: hch @ 2020-10-23 11:12 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: hch, Mike Snitzer, jack, gustavo, linux-mm,
	device-mapper development, pavel, steve, osandov,
	Alasdair G Kergon, bvanassche, Darrick J. Wong, len.brown,
	linux-pm, ming.lei, linux-block, linux-fsdevel, viro,
	Sergei Shtepa, koct9i, axboe, Damien Le Moal, Johannes Thumshirn,
	rjw, linux-kernel, linux-xfs, tj, akpm

On Fri, Oct 23, 2020 at 12:31:05PM +0200, Hannes Reinecke wrote:
> My thoughts went more into the direction of hooking into ->submit_bio,
> seeing that it's a NULL pointer for most (all?) block drivers.
> 
> But sure, I'll check how the interposer approach would turn out.

submit_bio is owned by the underlying device, and for a good reason
stored in a const struct..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] Block layer filter - second version
  2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
                     ` (2 preceding siblings ...)
  2020-10-21 15:09   ` Randy Dunlap
@ 2020-10-24 14:53   ` Greg KH
  3 siblings, 0 replies; 29+ messages in thread
From: Greg KH @ 2020-10-24 14:53 UTC (permalink / raw)
  To: Sergei Shtepa
  Cc: axboe, viro, hch, darrick.wong, linux-xfs, linux-fsdevel, rjw,
	len.brown, pavel, akpm, johannes.thumshirn, ming.lei, jack, tj,
	gustavo, bvanassche, osandov, koct9i, damien.lemoal, steve,
	linux-block, linux-kernel, linux-pm, linux-mm

On Wed, Oct 21, 2020 at 12:04:08PM +0300, Sergei Shtepa wrote:
> Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>

I know I don't take patches without any changelog text.

Maybe some maintainers are more lax...

Also, "second version" doesn't belong in the subject line, the
documentation shows how to properly version patch series, please do
that.

thanks,

greg k-h

> ---
>  block/Kconfig               |  11 ++
>  block/Makefile              |   1 +
>  block/blk-core.c            |  52 +++++--
>  block/blk-filter-internal.h |  29 ++++
>  block/blk-filter.c          | 286 ++++++++++++++++++++++++++++++++++++
>  block/partitions/core.c     |  14 +-
>  fs/block_dev.c              |   6 +-
>  fs/direct-io.c              |   2 +-
>  fs/iomap/direct-io.c        |   2 +-
>  include/linux/bio.h         |   4 +-
>  include/linux/blk-filter.h  |  76 ++++++++++
>  include/linux/genhd.h       |   8 +-
>  kernel/power/swap.c         |   2 +-
>  mm/page_io.c                |   4 +-
>  14 files changed, 471 insertions(+), 26 deletions(-)
>  create mode 100644 block/blk-filter-internal.h
>  create mode 100644 block/blk-filter.c
>  create mode 100644 include/linux/blk-filter.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index bbad5e8bbffe..a308801b4376 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -204,6 +204,17 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
>  	  by falling back to the kernel crypto API when inline
>  	  encryption hardware is not present.
>  
> +config BLK_FILTER
> +	bool "Enable support for block layer filters"
> +	default y
> +	depends on MODULES
> +	help
> +	  Enabling this lets third-party kernel modules intercept
> +	  bio requests for any block device. This allows them to implement
> +	  changed block tracking and snapshots without any reconfiguration of
> +	  the existing setup. For example, this option allows snapshotting of
> +	  a block device without adding it to LVM.
> +
>  menu "Partition Types"
>  
>  source "block/partitions/Kconfig"
> diff --git a/block/Makefile b/block/Makefile
> index 8d841f5f986f..b8ee50b8e031 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -38,3 +38,4 @@ obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
>  obj-$(CONFIG_BLK_PM)		+= blk-pm.o
>  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= keyslot-manager.o blk-crypto.o
>  obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK)	+= blk-crypto-fallback.o
> +obj-$(CONFIG_BLK_FILTER)	+= blk-filter.o
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 10c08ac50697..cc06402af695 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1216,23 +1216,20 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
>  EXPORT_SYMBOL(submit_bio_noacct);
>  
>  /**
> - * submit_bio - submit a bio to the block device layer for I/O
> - * @bio: The &struct bio which describes the I/O
> - *
> - * submit_bio() is used to submit I/O requests to block devices.  It is passed a
> - * fully set up &struct bio that describes the I/O that needs to be done.  The
> - * bio will be send to the device described by the bi_disk and bi_partno fields.
> + * submit_bio_direct - submit a bio to the block device layer for I/O
> + * bypass filter.
> + * @bio:  The bio describing the location in memory and on the device.
>   *
> - * The success/failure status of the request, along with notification of
> - * completion, is delivered asynchronously through the ->bi_end_io() callback
> - * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
> - * been called.
> + * Description:
> + *    This is a version of submit_bio() that shall only be used for I/O
> + *    that cannot be intercepted by block layer filters.
> + *    All file systems and other upper level users of the block layer
> + *    should use submit_bio() instead.
> + *    Use this function to access the swap partition and directly access
> + *    the block device file.
>   */
> -blk_qc_t submit_bio(struct bio *bio)
> +blk_qc_t submit_bio_direct(struct bio *bio)
>  {
> -	if (blkcg_punt_bio_submit(bio))
> -		return BLK_QC_T_NONE;
> -
>  	/*
>  	 * If it's a regular read/write or a barrier with data attached,
>  	 * go through the normal accounting stuff before submission.
> @@ -1282,8 +1279,35 @@ blk_qc_t submit_bio(struct bio *bio)
>  
>  	return submit_bio_noacct(bio);
>  }
> +EXPORT_SYMBOL(submit_bio_direct);
> +
> +/**
> + * submit_bio - submit a bio to the block device layer for I/O
> + * @bio: The &struct bio which describes the I/O
> + *
> + * submit_bio() is used to submit I/O requests to block devices.  It is passed a
> + * fully set up &struct bio that describes the I/O that needs to be done.  The
> + * bio will be send to the device described by the bi_disk and bi_partno fields.
> + *
> + * The success/failure status of the request, along with notification of
> + * completion, is delivered asynchronously through the ->bi_end_io() callback
> + * in @bio.  The bio must NOT be touched by thecaller until ->bi_end_io() has
> + * been called.
> + */
> +void submit_bio(struct bio *bio)
> +{
> +	if (blkcg_punt_bio_submit(bio))
> +		return;
> +
> +#ifdef CONFIG_BLK_FILTER
> +	blk_filter_submit_bio(bio);
> +#else
> +	submit_bio_direct(bio);
> +#endif
> +}
>  EXPORT_SYMBOL(submit_bio);
>  
> +
>  /**
>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>   *                              for the new queue limits
> diff --git a/block/blk-filter-internal.h b/block/blk-filter-internal.h
> new file mode 100644
> index 000000000000..d456a09f50db
> --- /dev/null
> +++ b/block/blk-filter-internal.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + *
> + * Block device filters internal declarations
> + */
> +
> +#ifndef BLK_FILTER_INTERNAL_H
> +#define BLK_FILTER_INTERNAL_H
> +
> +#ifdef CONFIG_BLK_FILTER
> +#include <linux/blk-filter.h>
> +
> +void blk_filter_part_add(struct hd_struct *part, dev_t devt);
> +
> +void blk_filter_part_del(struct hd_struct *part);
> +
> +#else /* CONFIG_BLK_FILTER */
> +
> +
> +static inline void blk_filter_part_add(struct hd_struct *part, dev_t devt)
> +{ };
> +
> +static inline void blk_filter_part_del(struct hd_struct *part)
> +{ };
> +
> +#endif /* CONFIG_BLK_FILTER */
> +
> +#endif
> diff --git a/block/blk-filter.c b/block/blk-filter.c
> new file mode 100644
> index 000000000000..f6de16c45a16
> --- /dev/null
> +++ b/block/blk-filter.c
> @@ -0,0 +1,286 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/genhd.h>
> +#include <linux/bio.h>
> +#include <linux/blkdev.h>
> +#include "blk-filter-internal.h"
> +#include <linux/rwsem.h>
> +
> +
> +LIST_HEAD(filters);
> +DECLARE_RWSEM(filters_lock);
> +
> +static void blk_filter_release(struct kref *kref)
> +{
> +	struct blk_filter *flt = container_of(kref, struct blk_filter, kref);
> +
> +	kfree(flt);
> +}
> +
> +static inline void blk_filter_get(struct blk_filter *flt)
> +{
> +	kref_get(&flt->kref);
> +}
> +
> +static inline void blk_filter_put(struct blk_filter *flt)
> +{
> +	kref_put(&flt->kref, blk_filter_release);
> +}
> +
> +
> +/**
> + * blk_filter_part_add() - Notify filters when a new partition is added.
> + * @part: The partition for new block device.
> + * @devt: Device id for new block device.
> + *
> + * Description:
> + *    When the block device is appears in the system, call the filter
> + *    callback to notify that the block device appears.
> + */
> +void blk_filter_part_add(struct hd_struct *part, dev_t devt)
> +{
> +	down_read(&filters_lock);
> +	if (!list_empty(&filters)) {
> +		struct list_head *_list_head;
> +
> +		list_for_each(_list_head, &filters) {
> +			void *filter_data;
> +			bool attached = false;
> +			struct blk_filter *flt;
> +
> +			flt = list_entry(_list_head, struct blk_filter, link);
> +
> +			attached = flt->ops->part_add(devt, &filter_data);
> +			if (attached) {
> +				blk_filter_get(flt);
> +				part->filter = flt;
> +				part->filter_data = filter_data;
> +				break;
> +			}
> +		}
> +	}
> +	up_read(&filters_lock);
> +}
> +
> +/**
> + * blk_filter_part_del() - Notify filters when the partition is deleted.
> + * @part: The partition of block device.
> + *
> + * Description:
> + *    When the block device is destroying and the partition is releasing,
> + *    call the filter callback to notify that the block device will be
> + *    deleted.
> + */
> +void blk_filter_part_del(struct hd_struct *part)
> +{
> +	struct blk_filter *flt = part->filter;
> +
> +	if (!flt)
> +		return;
> +
> +	flt->ops->part_del(part->filter_data);
> +
> +	part->filter_data = NULL;
> +	part->filter = NULL;
> +	blk_filter_put(flt);
> +}
> +
> +
> +/**
> + * blk_filter_submit_bio() - Send new bio to filters for processing.
> + * @bio: The new bio for block I/O layer.
> + *
> + * Description:
> + *    This function is an implementation of block layer filter
> + *    interception. If the filter is attached to this block device,
> + *    then bio will be redirected to the filter kernel module.
> + */
> +void blk_filter_submit_bio(struct bio *bio)
> +{
> +	bool intercepted = false;
> +	struct hd_struct *part;
> +
> +	bio_get(bio);
> +
> +	part = disk_get_part(bio->bi_disk, bio->bi_partno);
> +	if (unlikely(!part)) {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +
> +		bio_put(bio);
> +		return;
> +	}
> +
> +	down_read(&part->filter_rw_lockup);
> +
> +	if (part->filter)
> +		intercepted = part->filter->ops->filter_bio(bio, part->filter_data);
> +
> +	up_read(&part->filter_rw_lockup);
> +
> +	if (!intercepted)
> +		submit_bio_direct(bio);
> +
> +	disk_put_part(part);
> +
> +	bio_put(bio);
> +}
> +EXPORT_SYMBOL(blk_filter_submit_bio);
> +
> +/**
> + * blk_filter_register() - Register block layer filter.
> + * @ops: New filter callbacks.
> + *
> + * Return:
> + *     Filter ID, a pointer to the service structure of the filter.
> + *
> + * Description:
> + *    Create new filter structure.
> + *    Use blk_filter_attach to attach devices to filter.
> + */
> +void *blk_filter_register(struct blk_filter_ops *ops)
> +{
> +	struct blk_filter *flt;
> +
> +	flt = kzalloc(sizeof(struct blk_filter), GFP_KERNEL);
> +	if (!flt)
> +		return NULL;
> +
> +	kref_init(&flt->kref);
> +	flt->ops = ops;
> +
> +	down_write(&filters_lock);
> +	list_add_tail(&flt->link, &filters);
> +	up_write(&filters_lock);
> +
> +	return flt;
> +}
> +EXPORT_SYMBOL(blk_filter_register);
> +
> +/**
> + * blk_filter_unregister() - Unregister block layer filter.
> + * @filter: filter identifier.
> + *
> + * Description:
> + *    Before call blk_filter_unregister() and unload filter module all
> + *    partitions MUST be detached. Otherwise, the system will have a
> + *    filter with non-existent interception functions.
> + */
> +void blk_filter_unregister(void *filter)
> +{
> +	struct blk_filter *flt = filter;
> +
> +	down_write(&filters_lock);
> +	list_del(&flt->link);
> +	up_write(&filters_lock);
> +
> +	blk_filter_put(flt);
> +}
> +EXPORT_SYMBOL(blk_filter_unregister);
> +
> +/**
> + * blk_filter_attach() - Attach block layer filter.
> + * @devt: The block device identification number.
> + * @filter: Filter identifier.
> + * @filter_data: Specific filters data for this device.
> + *
> + * Return:
> + *    Return code.
> + *    -ENODEV - cannot find this device, it is OK if the device does not exist yet.
> + *    -EALREADY - this device is already attached to this filter.
> + *    -EBUSY - this device is already attached to the another filter.
> + *
> + * Description:
> + *    Attach the device to the block layer filter.
> + *    Only one filter can be attached to a single device.
> + */
> +int blk_filter_attach(dev_t devt, void *filter, void *filter_data)
> +{
> +	int ret = 0;
> +	struct blk_filter *flt = filter;
> +	struct block_device *blk_dev;
> +
> +
> +	blk_dev = bdget(devt);
> +	if (!blk_dev)
> +		return -ENODEV;
> +
> +	blk_filter_freeze(blk_dev);
> +
> +	if (blk_dev->bd_part->filter) {
> +		if (blk_dev->bd_part->filter == flt)
> +			ret = -EALREADY;
> +		else
> +			ret = -EBUSY;
> +	} else {
> +		blk_filter_get(flt);
> +		blk_dev->bd_part->filter = flt;
> +		blk_dev->bd_part->filter_data = filter_data;
> +	}
> +
> +	blk_filter_thaw(blk_dev);
> +
> +	bdput(blk_dev);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(blk_filter_attach);
> +
> +/**
> + * blk_filter_detach() - Detach block layer filter.
> + * @devt: The block device identification number.
> + *
> + * Description:
> + *    Detach the device from the block layer filter.
> + *    Do not forget detach all devices before calling the
> + *    blk_filter_unregister() function and unload the module!
> + */
> +void blk_filter_detach(dev_t devt)
> +{
> +	struct blk_filter *flt;
> +	struct block_device *blk_dev;
> +
> +	blk_dev = bdget(devt);
> +	if (!blk_dev)
> +		return;
> +
> +	blk_filter_freeze(blk_dev);
> +
> +	flt = blk_dev->bd_part->filter;
> +	if (flt) {
> +		blk_dev->bd_part->filter_data = NULL;
> +		blk_dev->bd_part->filter = NULL;
> +		blk_filter_put(flt);
> +	}
> +
> +	blk_filter_thaw(blk_dev);
> +
> +	bdput(blk_dev);
> +}
> +EXPORT_SYMBOL(blk_filter_detach);
> +
> +/**
> + * blk_filter_freeze() - Lock bio submitting.
> + * @bdev: The block device pointer.
> + *
> + * Description:
> + *    Stop bio processing.
> + */
> +void blk_filter_freeze(struct block_device *bdev)
> +{
> +	down_write(&bdev->bd_part->filter_rw_lockup);
> +}
> +EXPORT_SYMBOL(blk_filter_freeze);
> +
> +/**
> + * blk_filter_thaw() - Unlock bio submitting.
> + * @bdev: The block device pointer.
> + *
> + * Description:
> + *    Resume bio processing.
> + */
> +void blk_filter_thaw(struct block_device *bdev)
> +{
> +	up_write(&bdev->bd_part->filter_rw_lockup);
> +}
> +EXPORT_SYMBOL(blk_filter_thaw);
> diff --git a/block/partitions/core.c b/block/partitions/core.c
> index 722406b841df..6b845e98b9a1 100644
> --- a/block/partitions/core.c
> +++ b/block/partitions/core.c
> @@ -11,6 +11,7 @@
>  #include <linux/blktrace_api.h>
>  #include <linux/raid/detect.h>
>  #include "check.h"
> +#include "../blk-filter-internal.h"
>  
>  static int (*check_part[])(struct parsed_partitions *) = {
>  	/*
> @@ -320,9 +321,11 @@ int hd_ref_init(struct hd_struct *part)
>   */
>  void delete_partition(struct gendisk *disk, struct hd_struct *part)
>  {
> -	struct disk_part_tbl *ptbl =
> -		rcu_dereference_protected(disk->part_tbl, 1);
> +	struct disk_part_tbl *ptbl;
> +
> +	blk_filter_part_del(part);
>  
> +	ptbl = rcu_dereference_protected(disk->part_tbl, 1);
>  	/*
>  	 * ->part_tbl is referenced in this part's release handler, so
>  	 *  we have to hold the disk device
> @@ -412,6 +415,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	p->nr_sects = len;
>  	p->partno = partno;
>  	p->policy = get_disk_ro(disk);
> +#ifdef CONFIG_BLK_FILTER
> +	init_rwsem(&p->filter_rw_lockup);
> +#endif
>  
>  	if (info) {
>  		struct partition_meta_info *pinfo;
> @@ -469,6 +475,9 @@ static struct hd_struct *add_partition(struct gendisk *disk, int partno,
>  	/* everything is up and running, commence */
>  	rcu_assign_pointer(ptbl->part[partno], p);
>  
> +	/*inform filter about a new partition*/
> +	blk_filter_part_add(p, devt);
> +
>  	/* suppress uevent if the disk suppresses it */
>  	if (!dev_get_uevent_suppress(ddev))
>  		kobject_uevent(&pdev->kobj, KOBJ_ADD);
> @@ -552,6 +561,7 @@ int bdev_del_partition(struct block_device *bdev, int partno)
>  		goto out_unlock;
>  
>  	sync_blockdev(bdevp);
> +
>  	invalidate_bdev(bdevp);
>  
>  	delete_partition(bdev->bd_disk, part);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 8ae833e00443..431eae17fd8f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -237,7 +237,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_HIPRI)
>  		bio_set_polled(&bio, iocb);
>  
> -	qc = submit_bio(&bio);
> +	qc = submit_bio_direct(&bio);
>  	for (;;) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		if (!READ_ONCE(bio.bi_private))
> @@ -400,7 +400,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  				polled = true;
>  			}
>  
> -			qc = submit_bio(bio);
> +			qc = submit_bio_direct(bio);
>  
>  			if (polled)
>  				WRITE_ONCE(iocb->ki_cookie, qc);
> @@ -421,7 +421,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  			atomic_inc(&dio->ref);
>  		}
>  
> -		submit_bio(bio);
> +		submit_bio_direct(bio);
>  		bio = bio_alloc(GFP_KERNEL, nr_pages);
>  	}
>  
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 183299892465..d9bb1b6f6814 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -459,7 +459,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
>  		sdio->submit_io(bio, dio->inode, sdio->logical_offset_in_bio);
>  		dio->bio_cookie = BLK_QC_T_NONE;
>  	} else
> -		dio->bio_cookie = submit_bio(bio);
> +		dio->bio_cookie = submit_bio_direct(bio);
>  
>  	sdio->bio = NULL;
>  	sdio->boundary = 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c1aafb2ab990..e05f20ce8b5f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -73,7 +73,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  				file_inode(dio->iocb->ki_filp),
>  				iomap, bio, pos);
>  	else
> -		dio->submit.cookie = submit_bio(bio);
> +		dio->submit.cookie = submit_bio_direct(bio);
>  }
>  
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c6d765382926..5b0a32697207 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioprio.h>
>  /* struct bio, bio_vec and BIO_* flags are defined in blk_types.h */
>  #include <linux/blk_types.h>
> +#include <linux/blk-filter.h>
>  
>  #define BIO_DEBUG
>  
> @@ -411,7 +412,8 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
>  	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
>  }
>  
> -extern blk_qc_t submit_bio(struct bio *);
> +extern blk_qc_t submit_bio_direct(struct bio *bio);
> +extern void submit_bio(struct bio *bio);
>  
>  extern void bio_endio(struct bio *);
>  
> diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
> new file mode 100644
> index 000000000000..f3e79e5b4586
> --- /dev/null
> +++ b/include/linux/blk-filter.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * API declarations for kernel modules utilizing block device filters
> + */
> +
> +#ifndef BLK_FILTER_H
> +#define BLK_FILTER_H
> +
> +#ifdef CONFIG_BLK_FILTER
> +#include <linux/kref.h>
> +
> +struct blk_filter_ops {
> +	/*
> +	 * Intercept bio callback.
> +	 *
> +	 * Returns true if the request was intercepted and placed in the
> +	 * queue for processing. Otherwise submit_bio_direct() calling
> +	 * needed.
> +	 */
> +	bool (*filter_bio)(struct bio *bio, void *filter_data);
> +
> +	/*
> +	 * Callback to a request to add block device to the filter.
> +	 *
> +	 * Returns true if the block device will be filtered.
> +	 * p_filter_data gets a pointer to data that is unique to
> +	 * this device.
> +	 */
> +	bool (*part_add)(dev_t devt, void **p_filter_data);
> +
> +	/*
> +	 * Callback to remove block device from the filter.
> +	 */
> +	void (*part_del)(void *filter_data);
> +};
> +
> +struct blk_filter {
> +	struct list_head link;
> +	struct kref kref;
> +	struct blk_filter_ops *ops;
> +};
> +
> +/*
> + * Register/unregister device to filter
> + */
> +void *blk_filter_register(struct blk_filter_ops *ops);
> +
> +void blk_filter_unregister(void *filter);
> +
> +/*
> + * Attach/detach device to filter
> + */
> +int blk_filter_attach(dev_t devt, void *filter, void *filter_data);
> +
> +void blk_filter_detach(dev_t devt);
> +
> +/*
> + * For a consistent state of the file system use the freeze_bdev/thaw_bdav.
> + * But in addition, to ensure that the filter is not in the state of
> + * intercepting the next BIO, you need to call black_filter_freeze/blk_filter_thaw.
> + * This is especially actual if there is no file system on the disk.
> + */
> +
> +void blk_filter_freeze(struct block_device *bdev);
> +
> +void blk_filter_thaw(struct block_device *bdev);
> +
> +/*
> + * Filters intercept function
> + */
> +void blk_filter_submit_bio(struct bio *bio);
> +
> +#endif /* CONFIG_BLK_FILTER */
> +
> +#endif
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 4ab853461dff..514fab6b947e 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -4,7 +4,7 @@
>  
>  /*
>   * 	genhd.h Copyright (C) 1992 Drew Eckhardt
> - *	Generic hard disk header file by  
> + *	Generic hard disk header file by
>   * 		Drew Eckhardt
>   *
>   *		<drew@colorado.edu>
> @@ -75,6 +75,12 @@ struct hd_struct {
>  	int make_it_fail;
>  #endif
>  	struct rcu_work rcu_work;
> +
> +#ifdef CONFIG_BLK_FILTER
> +	struct rw_semaphore filter_rw_lockup; /* for freezing block device*/
> +	struct blk_filter *filter; /* block layer filter*/
> +	void *filter_data; /*specific for each block device filters data*/
> +#endif
>  };
>  
>  /**
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 01e2858b5fe3..5287346b87a1 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -283,7 +283,7 @@ static int hib_submit_io(int op, int op_flags, pgoff_t page_off, void *addr,
>  		bio->bi_end_io = hib_end_io;
>  		bio->bi_private = hb;
>  		atomic_inc(&hb->count);
> -		submit_bio(bio);
> +		submit_bio_direct(bio);
>  	} else {
>  		error = submit_bio_wait(bio);
>  		bio_put(bio);
> diff --git a/mm/page_io.c b/mm/page_io.c
> index e485a6e8a6cd..4540426400b3 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -362,7 +362,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  	count_swpout_vm_event(page);
>  	set_page_writeback(page);
>  	unlock_page(page);
> -	submit_bio(bio);
> +	submit_bio_direct(bio);
>  out:
>  	return ret;
>  }
> @@ -434,7 +434,7 @@ int swap_readpage(struct page *page, bool synchronous)
>  	}
>  	count_vm_event(PSWPIN);
>  	bio_get(bio);
> -	qc = submit_bio(bio);
> +	qc = submit_bio_direct(bio);
>  	while (synchronous) {
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		if (!READ_ONCE(bio->bi_private))
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-10-24 14:53 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-21  9:04 [PATCH 0/2] block layer filter and block device snapshot module Sergei Shtepa
2020-10-21  9:04 ` [PATCH 1/2] Block layer filter - second version Sergei Shtepa
2020-10-21  9:14   ` Johannes Thumshirn
2020-10-21 10:01     ` Sergei Shtepa
2020-10-21  9:21   ` Damien Le Moal
2020-10-21 10:27     ` Sergei Shtepa
2020-10-21 11:44     ` Matthew Wilcox
2020-10-21 12:55       ` Sergei Shtepa
2020-10-21 13:07         ` Matthew Wilcox
2020-10-21 14:35           ` Sergei Shtepa
2020-10-21 15:09   ` Randy Dunlap
2020-10-24 14:53   ` Greg KH
     [not found] ` <1603271049-20681-3-git-send-email-sergei.shtepa@veeam.com>
2020-10-21  9:08   ` [PATCH 2/2] blk-snap - snapshots and change-tracking for block devices Pavel Machek
2020-10-21  9:37     ` Sergei Shtepa
     [not found]   ` <BL0PR04MB65142D9F391FE8777F096EF5E71C0@BL0PR04MB6514.namprd04.prod.outlook.com>
2020-10-21 11:15     ` Sergei Shtepa
2020-10-21 15:11   ` Randy Dunlap
2020-10-21 13:31 ` [PATCH 0/2] block layer filter and block device snapshot module Hannes Reinecke
2020-10-21 14:10   ` Sergei Shtepa
2020-10-22  5:58     ` Hannes Reinecke
2020-10-22  9:44       ` Sergei Shtepa
2020-10-22 10:28         ` Damien Le Moal
2020-10-22 13:52           ` Sergei Shtepa
2020-10-22 15:14             ` Darrick J. Wong
2020-10-22 17:54               ` Mike Snitzer
2020-10-23  9:13                 ` hch
2020-10-23 10:31                   ` Hannes Reinecke
2020-10-23 11:04                     ` Sergei Shtepa
2020-10-23 11:12                     ` [dm-devel] " hch
2020-10-22 18:35 ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).