All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sergei Shtepa <sergei.shtepa@veeam.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "axboe@kernel.dk" <axboe@kernel.dk>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/20] block, blk_filter: enable block device filters
Date: Fri, 8 Jul 2022 12:45:33 +0200	[thread overview]
Message-ID: <aec88137-070e-7c1d-215f-86a1e6d4b10a@veeam.com> (raw)
In-Reply-To: <YscXTGXumE5Ust15@infradead.org>



On 7/7/22 19:26, Christoph Hellwig wrote:
> 
> On Thu, Jul 07, 2022 at 10:26:55AM +0200, Sergei Shtepa wrote:
>> Thank you, Christoph, for your attention to the patch.
>>
>> I am preparing the next version of the patch. In it, I planned to
>> simplify the bdev_filer code.
>> I will make changes in it, in accordance with your comments, and
>> will add your code and check it on my test labs.
>>
>> But I'm not sure if using the blk_mq_freeze_queue() is appropriate.
>> If I understood the code correctly, it is based on the expectation
>> that the counter q->q_usage_counter will decrease to zero.
>> To increase it, a blk_queue_enter() is used. And at the time of
>> calling the filter_bio() in the submit_bio_noacct(), this counter
>> has not yet been increased. I will double check this and try to
>> get rid of the bdev->bd_filter_lock.
> Indeed.  For this to work we'd need to call the filter driver
> later.  Which is brings up another question:  Is there a real
> need to attach the filter driver to the bdev and thus potentially
> partition?  The rest of the block layer operates on the whole disk
> after the intial partition remapping, and besides allowing the
> filter driver to be called under q_usage_counter, this would
> also clean up some concepts.  It would probably also allow to
> remove the repeat return value over just using submit_bio_noacct
> similar to how normal stacking drivers reinject bios.
> 

Thank you Christoph.
This is the most crucial question for the entire patch.
The filtering location sets restrictions for the filter code and
determines its main algorithm.

1. Work at the partition or disk level?
At the user level, programs operate with block devices.
In fact, the "disk" entity makes sense only for the kernel level. 
When the user chooses which block devices to backup and which not,
he operates with mounting points, which are converted into block
devices, partitions. Therefore, it is better to handle bio before
remapping to disk.
If the filtering is performed after remapping, then we will be
forced to apply a filter to the entire disk, or complicate the
filtering algorithm by calculating which range of sectors bio is
addressed to. And if bio is addressed to the partition boundary...
Filtering at the block device level seems to me a simpler solution.
But this is not the biggest problem.

2. Can the filter sleep or postpone bio processing to the worker thread?
The problem is in the implementation of the COW algorithm.
If I send a bio to read a chunk (one bio), and then pass a write bio,
then with some probability I am reading partially overwritten data.
Writing overtakes reading. And flags REQ_SYNC and REQ_PREFLUSH don't help.
Maybe it's a disk driver issue, or a hypervisor, or a NAS, or a RAID,
or maybe normal behavior. I don't know. Although, maybe I'm not working
correctly with flags. I have seen the comments on patch 11/20, but I am
not sure that the fixes will solve this problem.
But because of this, I have to postpone the write until the read completes.

2.1 The easiest way to solve the problem is to block the writer's thread
with a semaphore. And for bio with a flag REQ_NOWAIT, complete processing
with bio_wouldblock_error(). This is the solution currently being used.

2.2 Another solution is possible without putting the thread into a sleep
state, but with placing a write bio in a queue to another thread.
This solution is used in the veeamsnap out-of-tree module and it has
performance issues. I don't like. But when handling make_request_fn,
which was on kernels before 5.10, there was no choice.

The current implementation, when the filtering is performed before
remapping, allows to handle the bio to the partition, and allows to
switch the writer's thread to the sleep state.
I had to pay for it with a reference counter on the filter and a spinlock.
It may be possible to do better with RCU. I haven't tried it yet.

If I am blocked by the q->q_usage_counter counter, then I will not
be able to execute COW in the context of the current thread due to deadlocks.
I will have to use a scheme with an additional worker thread.
Bio filtering will become much more complicated.

From an architectural point of view, I see the filter as an intermediate
layer between the file system and the block layer. If we lower the filter
deep into the block layer, then restrictions will be imposed on its use.


  reply	other threads:[~2022-07-08 10:45 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-13 15:52 [PATCH 00/20] blksnap - creating non-persistent snapshots for backup Sergei Shtepa
2022-06-13 15:52 ` [PATCH 01/20] block, blk_filter: enable block device filters Sergei Shtepa
2022-06-13 21:50   ` Randy Dunlap
2022-06-14  9:19     ` Sergei Shtepa
2022-06-14  9:21     ` Sergei Shtepa
2022-06-14  9:01   ` kernel test robot
2022-07-06 12:59   ` Christoph Hellwig
2022-07-07  8:26     ` Sergei Shtepa
2022-07-07 17:26       ` Christoph Hellwig
2022-07-08 10:45         ` Sergei Shtepa [this message]
2022-07-13 11:56           ` Christoph Hellwig
2022-07-13 13:47             ` Sergei Shtepa
2022-07-14  5:12               ` Christoph Hellwig
2022-07-14  9:22                 ` Sergei Shtepa
2022-06-13 15:52 ` [PATCH 02/20] block, blksnap: header file of the module interface Sergei Shtepa
2022-07-06 13:03   ` Christoph Hellwig
2022-06-13 15:52 ` [PATCH 03/20] block, blksnap: module management interface functions Sergei Shtepa
2022-06-13 22:44   ` Chaitanya Kulkarni
2022-06-13 15:52 ` [PATCH 04/20] block, blksnap: init() and exit() functions Sergei Shtepa
2022-06-13 15:52 ` [PATCH 05/20] block, blksnap: interaction with sysfs Sergei Shtepa
2022-06-13 15:52 ` [PATCH 06/20] block, blksnap: attaching and detaching the filter and handling a bios Sergei Shtepa
2022-06-13 15:53 ` [PATCH 07/20] block, blksnap: map of change block tracking Sergei Shtepa
2022-06-13 15:53 ` [PATCH 08/20] block, blksnap: big buffer in the form of an array of pages Sergei Shtepa
2022-07-06 13:09   ` Christoph Hellwig
2022-06-13 15:53 ` [PATCH 09/20] block, blksnap: minimum data storage unit of the original block device Sergei Shtepa
2022-06-13 15:53 ` [PATCH 10/20] block, blksnap: buffer in memory for the minimum data storage unit Sergei Shtepa
2022-06-13 15:53 ` [PATCH 11/20] block, blksnap: functions and structures for performing block I/O operations Sergei Shtepa
2022-07-07 17:33   ` Christoph Hellwig
2022-06-13 15:53 ` [PATCH 12/20] block, blksnap: storage for storing difference blocks Sergei Shtepa
2022-06-13 15:53 ` [PATCH 13/20] block, blksnap: event queue from the difference storage Sergei Shtepa
2022-06-13 15:53 ` [PATCH 14/20] block, blksnap: owner of information about overwritten blocks of the original block device Sergei Shtepa
2022-06-13 15:53 ` [PATCH 15/20] block, blksnap: snapshot image " Sergei Shtepa
2022-07-06 13:13   ` Christoph Hellwig
2022-07-07  9:16     ` Sergei Shtepa
2022-07-07 17:24       ` Christoph Hellwig
2022-07-08  7:58         ` Sergei Shtepa
2022-07-08  8:04           ` Christoph Hellwig
2022-06-13 15:53 ` [PATCH 16/20] block, blksnap: snapshot Sergei Shtepa
2022-06-13 15:53 ` [PATCH 17/20] block, blksnap: debugging mechanism for monitoring memory consumption Sergei Shtepa
2022-06-13 15:53 ` [PATCH 18/20] block, blksnap: Kconfig Sergei Shtepa
2022-06-13 21:38   ` Randy Dunlap
2022-07-06 13:06   ` Christoph Hellwig
2022-06-13 15:53 ` [PATCH 19/20] block, blksnap: Makefile Sergei Shtepa
2022-07-06 13:06   ` Christoph Hellwig
2022-06-13 15:53 ` [PATCH 20/20] block, blksnap: adds a blksnap to the kernel tree Sergei Shtepa
2022-06-18  2:11   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aec88137-070e-7c1d-215f-86a1e6d4b10a@veeam.com \
    --to=sergei.shtepa@veeam.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.