All of lore.kernel.org
 help / color / mirror / Atom feed
* dm-thin: Several Questions on dm-thin performance.
@ 2019-11-22  3:14 JeffleXu
  2019-11-22 18:55 ` Joe Thornber
  0 siblings, 1 reply; 13+ messages in thread
From: JeffleXu @ 2019-11-22  3:14 UTC (permalink / raw)
  To: dm-devel

Hi guys,

I have several questions on dm-thin when I'm testing and evaluating IO 
performance of dm-thin. I would be grateful if someone could spend a 
little time on it.


The first question is what's the purpose of data cell? In 
thin_bio_map(), normal bio will be packed as a virtual cell and data 
cell. I can understand that virtual cell is used to prevent discard bio 
and non-discard bio targeting the same block from being processed at the 
same time. I find it was added in commit     
e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin: fix race between 
simultaneous io and discards to same block), but I'm still confused 
about the use of data cell.


The second question is the impact of virtual cell and data cell on IO 
performance. If $data_block_size is large for example 1G, in multithread 
fio test, most bio will be buffered in cell->bios list and then be 
processed by worker thread asynchronously, even when there's no discard 
bio. Thus the original parallel IO is processed by worker thread 
serially now. As the number of fio test threads increase, the single 
worker thread can easily get CPU 100%, and thus become the bottleneck of 
the performance since dm-thin workqueue is ordered unbound.

Using an nvme SSD and fio (direct=1, ioengine=libaio, iodepth=128, 
numjobs=4, rw=read, bs=4k), the bandwidth on bare nvme is 1589MiB/s. The 
bandwidth on thin device is only 1274MiB/s, while the four fio threads 
run at 200% CPU and the single worker thread is always runing at 100% 
CPU. perf of worker thread showes that process_bio() consumes 86% of the 
time.


Regards

Jeffle Xu


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-11-22  3:14 dm-thin: Several Questions on dm-thin performance JeffleXu
@ 2019-11-22 18:55 ` Joe Thornber
  2019-12-02  7:50   ` JeffleXu
  2019-12-06 14:15   ` Nikos Tsironis
  0 siblings, 2 replies; 13+ messages in thread
From: Joe Thornber @ 2019-11-22 18:55 UTC (permalink / raw)
  To: JeffleXu; +Cc: dm-devel

On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:

> The first question is what's the purpose of data cell? In thin_bio_map(),
> normal bio will be packed as a virtual cell and data cell. I can understand
> that virtual cell is used to prevent discard bio and non-discard bio
> targeting the same block from being processed at the same time. I find it
> was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
> fix race between simultaneous io and discards to same block), but I'm still
> confused about the use of data cell.

As you are aware there are two address spaces for the locks.  The 'virtual' one
refers to cells in the logical address space of the thin devices, and the 'data' one
refers to the underlying data device.  There are certain conditions where we 
unfortunately need to hold both of these (eg, to prevent a data block being reprovisioned
before an io to it has completed).

> The second question is the impact of virtual cell and data cell on IO
> performance. If $data_block_size is large for example 1G, in multithread fio
> test, most bio will be buffered in cell->bios list and then be processed by
> worker thread asynchronously, even when there's no discard bio. Thus the
> original parallel IO is processed by worker thread serially now. As the
> number of fio test threads increase, the single worker thread can easily get
> CPU 100%, and thus become the bottleneck of the performance since dm-thin
> workqueue is ordered unbound.

Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
new interface that we need to move dm-thin across to use (dm-cache already uses it).
It allows concurrent holders of a cell (ie, read locks), so we'll be able to remap
much more io without handing it off to a worker thread.  Once this is done I want
to add an extra field to cells that will cache the mapping, this way if you acquire a
cell that is already held then you can avoid the expensive btree lookup.  Together 
these changes should make a huge difference to the performance.

If you've got some spare coding cycles I'd love some help with this ;)

- Joe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-11-22 18:55 ` Joe Thornber
@ 2019-12-02  7:50   ` JeffleXu
  2019-12-02 22:26     ` Eric Wheeler
  2019-12-06 14:15   ` Nikos Tsironis
  1 sibling, 1 reply; 13+ messages in thread
From: JeffleXu @ 2019-12-02  7:50 UTC (permalink / raw)
  To: dm-devel

Thanks for replying and explanation.

Anyway, it seems a significant workload to transform to bio-prison-cell-v2.


Regards

Jeffle


在 2019/11/23 上午2:55, Joe Thornber 写道:

> On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:
>
>> The first question is what's the purpose of data cell? In thin_bio_map(),
>> normal bio will be packed as a virtual cell and data cell. I can understand
>> that virtual cell is used to prevent discard bio and non-discard bio
>> targeting the same block from being processed at the same time. I find it
>> was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
>> fix race between simultaneous io and discards to same block), but I'm still
>> confused about the use of data cell.
> As you are aware there are two address spaces for the locks.  The 'virtual' one
> refers to cells in the logical address space of the thin devices, and the 'data' one
> refers to the underlying data device.  There are certain conditions where we
> unfortunately need to hold both of these (eg, to prevent a data block being reprovisioned
> before an io to it has completed).
>
>> The second question is the impact of virtual cell and data cell on IO
>> performance. If $data_block_size is large for example 1G, in multithread fio
>> test, most bio will be buffered in cell->bios list and then be processed by
>> worker thread asynchronously, even when there's no discard bio. Thus the
>> original parallel IO is processed by worker thread serially now. As the
>> number of fio test threads increase, the single worker thread can easily get
>> CPU 100%, and thus become the bottleneck of the performance since dm-thin
>> workqueue is ordered unbound.
> Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
> new interface that we need to move dm-thin across to use (dm-cache already uses it).
> It allows concurrent holders of a cell (ie, read locks), so we'll be able to remap
> much more io without handing it off to a worker thread.  Once this is done I want
> to add an extra field to cells that will cache the mapping, this way if you acquire a
> cell that is already held then you can avoid the expensive btree lookup.  Together
> these changes should make a huge difference to the performance.
>
> If you've got some spare coding cycles I'd love some help with this ;)
>
> - Joe
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-02  7:50   ` JeffleXu
@ 2019-12-02 22:26     ` Eric Wheeler
  2019-12-03 12:51       ` Joe Thornber
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wheeler @ 2019-12-02 22:26 UTC (permalink / raw)
  To: Joe Thornber, JeffleXu; +Cc: dm-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4267 bytes --]

On Mon, 2 Dec 2019, JeffleXu wrote:
> 在 2019/11/23 上午2:55, Joe Thornber 写道:
> 
> > On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:
> >
> > > The first question is what's the purpose of data cell? In thin_bio_map(),
> > > normal bio will be packed as a virtual cell and data cell. I can
> > > understand
> > > that virtual cell is used to prevent discard bio and non-discard bio
> > > targeting the same block from being processed at the same time. I find it
> > > was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
> > > fix race between simultaneous io and discards to same block), but I'm
> > > still
> > > confused about the use of data cell.
> > As you are aware there are two address spaces for the locks.  The 'virtual'
> > one
> > refers to cells in the logical address space of the thin devices, and the
> > 'data' one
> > refers to the underlying data device.  There are certain conditions where we
> > unfortunately need to hold both of these (eg, to prevent a data block being
> > reprovisioned
> > before an io to it has completed).
> >
> > > The second question is the impact of virtual cell and data cell on IO
> > > performance. If $data_block_size is large for example 1G, in multithread
> > > fio
> > > test, most bio will be buffered in cell->bios list and then be processed
> > > by
> > > worker thread asynchronously, even when there's no discard bio. Thus the
> > > original parallel IO is processed by worker thread serially now. As the
> > > number of fio test threads increase, the single worker thread can easily
> > > get
> > > CPU 100%, and thus become the bottleneck of the performance since dm-thin
> > > workqueue is ordered unbound.
> > Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
> > new interface that we need to move dm-thin across to use (dm-cache already
> > uses it).
> > It allows concurrent holders of a cell (ie, read locks), so we'll be able to
> > remap
> > much more io without handing it off to a worker thread.  Once this is done I
> > want
> > to add an extra field to cells that will cache the mapping, this way if you
> > acquire a
> > cell that is already held then you can avoid the expensive btree lookup.
> > Together
> > these changes should make a huge difference to the performance.
> >
> > If you've got some spare coding cycles I'd love some help with this ;)

Hi Joe,

I'm not sure if I will have the time but thought I would start the 
research and ask a few questions. I looked at the v1/v2 .h files and some 
of the functions just change suffix to _v2 and maybe calling 
convention/structure field changes.

However, there appear to be some design changes, too:

* dm_deferred_set - These appear to be used a bit in dm-thin.c.  
The dm_deferred_set calls don't seem to reference anything prison-related, 
but they are defined in dm-bio-prison-v1.h.  Can you provide direction on 
how these would be refactored, or if they can just remain as-is?
  Call counts in dm-thin.c:
      2 dm_deferred_entry_dec
      2 dm_deferred_set_create
      3 dm_deferred_entry_inc
      3 dm_deferred_set_add_work
      4 dm_deferred_set_destroy

* dm_bio_detain - is this replaced by dm_cell_get_v2?
	- It looks like dm_bio_detain() returns 1 if already held, but 
	  dm_cell_get_v2() returns true if the lock is granted.  How might 
	  this be handled?
	- What are the lock_levels?
	- What in dm-thin.c would then call dm_cell_put_v2?

* dm_cell_release(_no_holder) - is this replaced by dm_cell_unlock_v2?
	- How would the _no_holder version be refactored?

* dm_cell_visit_release - This function uses a callback, but none of the 
v2 functions have such a callback.  Do we need to write a helper function 
with get/unlock(?) around the cell?


* dm_cell_error - no equivalent v2 implementation.  Suggestions?


What other considerations might there be in the v2 port?

Thanks!

--
Eric Wheeler


> >
> > - Joe
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-02 22:26     ` Eric Wheeler
@ 2019-12-03 12:51       ` Joe Thornber
  2019-12-04 12:11         ` Joe Thornber
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Thornber @ 2019-12-03 12:51 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: JeffleXu, dm-devel

On Mon, Dec 02, 2019 at 10:26:00PM +0000, Eric Wheeler wrote:

> Hi Joe,
> 
> I'm not sure if I will have the time but thought I would start the 
> research and ask a few questions. I looked at the v1/v2 .h files and some 
> of the functions just change suffix to _v2 and maybe calling 
> convention/structure field changes.
> 
> However, there appear to be some design changes, too:

Yes, the interface is different, and it's really not trivial to switch dm-thin
over to use it (otherwise I'd have done it already).  dm-cache already uses
the new interface which could be used as a guide, especially if you look at the patches
that made the switch.

I'm going to write up some notes over the next couple of days.  Which I'll post on this thread.

- Joe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-03 12:51       ` Joe Thornber
@ 2019-12-04 12:11         ` Joe Thornber
  2019-12-05 23:14           ` Eric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Thornber @ 2019-12-04 12:11 UTC (permalink / raw)
  To: Eric Wheeler, JeffleXu, dm-devel

(These notes are for my own benefit as much as anything, I haven't
worked on this for a couple of years and will forget it all completely
if I don't write it down somewhere).

Let's start by writing some pseudocode for what the remap function for
thin provisioning actually does.

	----------------------------------------------------------
	-- Metadata

	newtype ThinId = Int
	data Bio = Bio
	newtype VBlock = Integer	-- virtual block
	newtype DBlock = Integer	-- data block

	data LookupResult =
	    Unprovisioned |
	    Provisioned { lrDBlock :: DataBlock,
	                  lrShared :: Bool
	                }

	metadataLookup :: ThinId -> VBlock -> Task LookupResult
	metadataLookup = undefined

	metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
	metadataInsert = undefined

	metadataRemove :: ThinId -> VBlock -> Task ()
	metadataRemove = undefined

	-- Blocks all other metadata operations while running
	metadataCommit :: Task ()
	metadataCommit = undefined

	----------------------------------------------------------
	-- Tasks

	-- Many steps of servicing a bio can block.  eg, taking a lock,
	-- reading metadata, updating metadata, zeroing data, copying
	-- data ...
	-- So we completely side step the issue in this pseudocode by
	-- running everything in a magic light weight thread.
	spawn :: Task () -> IO ()
	spawn = undefined

	----------------------------------------------------------
	-- Locking

	-- These 'with' primitives acquire a lock (can block of course), perform
	-- an action, and then automatically release the lock.
	 
	-- Shared lock can be upgraded, so we need to pass the lock into
	-- the action body.
	withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
	withSharedLock thinId vblock actionFn = undefined

	withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
	withExclusiveLock thinId vblock action = undefined

	-- This promotes a shared lock to exclusive.
	withUpgradedLock :: Lock -> Task () -> Task ()
	withUpgradedLock lock action = undefined

	-- Data locks are always exclusive
	withDataLock :: DBlock -> Task () -> Task ()
	withDataLock dblock action = undefined

	----------------------------------------------------------

	-- Top level remap function.  Kicks off a green thread for each bio.
	-- How we handle a bio depends on whether it's a read, write, discard
	-- or flush bio.  Whether the block is already provisioned, and if so
	-- whether it is shared between snapshots.
	remap :: ThinId -> Bio -> IO ()
	remap thinId bio = spawn $ remapFn thinId bio vblock
	    where
	        vblock = virtBlock bio
	        remapFn = case classifyBio bio of
	            ReadBio -> remapRead
	            WriteBio -> remapWrite
	            DiscardBio -> remapDiscard
	            FlushBio -> remapFlush

	----------------------------------------------------------

	remapRead :: ThinId -> Bio -> VBlock -> Task ()
	remapRead thinId bio vblock = do
	    withSharedLock thinId vblock $ \_ -> do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Read, Unprovisioned, Shared/!Shared
	            Unprovisioned -> do
	                fillWithZeroes bio
	                complete bio Success

	            -- Read, Provisioned, Shared/!Shared
	            (Provisioned dblock _) ->
	                remapAndWait bio dblock

	----------------------------------------------------------

	remapWrite :: ThinId -> Bio -> VBlock -> Task ()
	remapWrite thinId bio vblock = do
	    withSharedLock thinId vblock $ \lock -> do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Write, Unprovisioned
	            Unprovisioned ->
	                withUpgradedLock lock $
	                    provision thinId bio vblock

	            -- Write, Provisioned, !Shared
	            (Provisioned dblock False) ->
	                remapAndWait bio dblock

	            -- Write, Provisioned, Shared
	            (Provisioned dblock True) ->
	                withUpgradedLock lock $
	                    breakSharing thinId bio vblock dblock

	breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
	breakSharing thinId bio vblock dblockOld = do
	    ab <- allocateBlock
	   case ab of
	       NoDataSpace ->
	           complete bio Failure

	       (Allocated dblockNew) ->
	           withDataLock dblockOld $		-- we grab data locks to avoid races with discard
	               withDataLock dblockNew $ do
	                   copy dblockOld dblockNew
	                   metadataInsert thinId vblock dblockNew
	           remapAndWait thinId bio dblockNew

	provision :: ThinId -> Bio -> VBlock -> Task ()
	provision thinId bio vblock = do
	    case allocateBlock of
	        NoDataSpace ->
	            complete bio Failure

	        (Allocated dblock) ->
	            withDataLock dblock $ do
	                metadataInsert thinId vblock dblock
	                remapAndWait thinId bio dblock
	            
	----------------------------------------------------------

	discard :: ThinId -> Bio -> VBlock -> Task ()
	discard thinId bio vblock = do
	    withExclusiveLock thinId vblock $ do
	        lr <- metadataLookup thinId vblock
	        case lr of
	            -- Discard, Unprovisioned
	            Unprovisioned ->
	                complete bio Success

	            -- Discard, Provisioned, !Shared
	            (Provisioned dblock False) ->
	                withDataLock dblock $ do
	                    remapAndWait bio dblock  		-- passdown
	                    metadataRemove thinId dblock

	            -- Discard, Provisioned, Shared
	           (Provisioned dblock True) ->
	               withDataLock dblock $ do
	                   metadataRemove thinId dblock
	                   complete bio Success

	----------------------------------------------------------

	flush :: Task ()
	flush = metadataCommit
	    
	----------------------------------------------------------

	remapAndWait :: Bio -> DataBlock -> Task ()
	remapAndWait bio dblock = do
	    remap bio dblock
	    issue bio
	    wait bio
   
The above is a simplification (eg, discards can cover more than a single
block, the pool has multiple modes like OUT_OF_DATA_SPACE).  But it gives
a good idea of what the dm target needs to do, and in a succinct manner.

Now dm-thin.c is anything but succinct, for a couple of reasons:

- Because of where we are in the IO stack we cannot allocate memory.
  This means we either use memory preallocated via mempools, or allocate
  a fixed size block before a bio is processed.

- We don't have a magic green threads library that hides the numerous
  blocking operations that we need.  Instead we have low level facilities
  like workqueues etc.  This tends to have the effect of breaking up the logic
  and scattering it across lots of little completion functions.


How we handle blocking, locking, and quiescing IO are all intertwined.
Which is why switching over to the new bio_prison interface is going to
involve an awful lot of churn.

In the upstream code
====================

- Locking

  The locks provided by bio_prison (v1), are all exclusive locks.  As such
  we take pains to hold them for as short a period as possible.  This means
  holding them for the duration of an IO is completely out of the question.
  Nonetheless, as pointed out in the original post for this thread, this
  can cause bad lock contention, especially if the data block size is large.

- Quiescing

  Because we do not hold locks for the lifetime of the bios, we need
  another way of tracking IO and quiescing regions.  This is what the
  deferred_set component does.  Effectively it divides time up into
  bins, and keeps a reference count of how many IOs are still in flight
  for each bin.  To quiesce we grab a lock, and then wait for all bins
  before this lock was acquired to drain.  Advantages of this approach
  is it uses very little memory (I think we're currently running with
  64 bins), and consumes v. little cpu.  But we're never specific about
  which region we're waiting to quiesce, instead always waiting for all
  IO older than a certain point to drain.  So we are certainly introducing
  more latency here.

- Blocking

  A single thread services any operations that could block.
  When there is work for this thread to perform a work queue item
  is queued (do_worker()).  This then examines linked lists of work
  (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
  processes each list as a batch.  Batching like this is a mixed blessing;
  it allows us to sort incoming bios so we can process bios to the same
  region at the same time, but it is also v. bad for max latency, as we
  have no idea which piece of work was there the longest.

Next iteration of the code
========================== 

- Locking

  bio_prison (v2) provides shared locks, and custom lock levels.  So,
  at the expense of memory, we can hold shared locks for long periods
  that cover the lifetime of the bio.  Acquiring a lock now blocks.

- Quiescing

  Because we hold the locks for long periods we can now do away with the
  deferred set completely.  If you want to quiesce a region, just grab
  the exclusive lock associated with it, when it's finally granted you
  know it's also quiesced.

- Blocking

  I want to move away from the idea of a single worker function that
  has different categories of work stored for it in different lists.
  Instead, storing specific work structs on the work queue for each bio.
  Partly this is to reduce latency by increasing 'fairness'.  But also
  the fact that acquiring a lock now blocks means there are a lot more
  block operations to handle, and we'd just end up with a lot of these
  lists of work.  It would also allow us to have multiple kernel threads
  servicing the workqueue.

  If you look at dm-cache-target.c you'll see this has already been
  done for that target.  We have continuation structs that represent
  the work to be performed after the current blocking op has completed.
  dm-cache uses this for migrations, which have a much simpler state model
  than dm-thin.  Even so there are a lot of these little continuation
  functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
  mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
  mg_success, mg_complete).


Where are we now?
=================

I did a lot of work on this a couple of years ago.  First I just dove
in, trying to code things up by hand.  But it quickly devolved into a
maze of badly named continuation functions, all alike.  It's very hard
to give these functions meaningful names; go through the pseudocode at
the top of this email and for each place where we could block, try to
describe where we are.  The biggest problem is as we introduce more of
these continuations big picture logic receeds and it becomes v. hard to
reason about the code.

I then experimented with automatically generating all the code from a
simpler specification (I used a lispy version of the pseudocode above).
This showed promise and I got it generating kernel code that would
compile.  I was debugging this when I got dragged onto other work,
and this has stagnated since.


So that's where we are.
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-04 12:11         ` Joe Thornber
@ 2019-12-05 23:14           ` Eric Wheeler
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Wheeler @ 2019-12-05 23:14 UTC (permalink / raw)
  To: Joe Thornber; +Cc: JeffleXu, dm-devel

Thanks Joe, great writeup.  Maybe this should go in thin-provisioning.txt.

More below:

On Wed, 4 Dec 2019, Joe Thornber wrote:

> (These notes are for my own benefit as much as anything, I haven't
> worked on this for a couple of years and will forget it all completely
> if I don't write it down somewhere).
> 
> Let's start by writing some pseudocode for what the remap function for
> thin provisioning actually does.
> 
> 	----------------------------------------------------------
> 	-- Metadata
> 
> 	newtype ThinId = Int
> 	data Bio = Bio
> 	newtype VBlock = Integer	-- virtual block
> 	newtype DBlock = Integer	-- data block

Can you define virtual block vs data block?  Is this just the thin volume 
offset vs the tdata offset?

> 
> 	data LookupResult =
> 	    Unprovisioned |
> 	    Provisioned { lrDBlock :: DataBlock,
> 	                  lrShared :: Bool
> 	                }

What is "lr"? as in lrDBlock?

> 	metadataLookup :: ThinId -> VBlock -> Task LookupResult
> 	metadataLookup = undefined
> 
> 	metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
> 	metadataInsert = undefined
> 
> 	metadataRemove :: ThinId -> VBlock -> Task ()
> 	metadataRemove = undefined
> 
> 	-- Blocks all other metadata operations while running
> 	metadataCommit :: Task ()
> 	metadataCommit = undefined
> 
> 	----------------------------------------------------------
> 	-- Tasks
> 
> 	-- Many steps of servicing a bio can block.  eg, taking a lock,
> 	-- reading metadata, updating metadata, zeroing data, copying
> 	-- data ...
> 	-- So we completely side step the issue in this pseudocode by
> 	-- running everything in a magic light weight thread.
> 	spawn :: Task () -> IO ()
> 	spawn = undefined
> 
> 	----------------------------------------------------------
> 	-- Locking
> 
> 	-- These 'with' primitives acquire a lock (can block of course), perform
> 	-- an action, and then automatically release the lock.
> 	 
> 	-- Shared lock can be upgraded, so we need to pass the lock into
> 	-- the action body.
> 	withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
> 	withSharedLock thinId vblock actionFn = undefined
> 
> 	withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
> 	withExclusiveLock thinId vblock action = undefined
> 
> 	-- This promotes a shared lock to exclusive.
> 	withUpgradedLock :: Lock -> Task () -> Task ()
> 	withUpgradedLock lock action = undefined
> 
> 	-- Data locks are always exclusive
> 	withDataLock :: DBlock -> Task () -> Task ()
> 	withDataLock dblock action = undefined
> 
> 	----------------------------------------------------------
> 
> 	-- Top level remap function.  Kicks off a green thread for each bio.
> 	-- How we handle a bio depends on whether it's a read, write, discard
> 	-- or flush bio.  Whether the block is already provisioned, and if so
> 	-- whether it is shared between snapshots.
> 	remap :: ThinId -> Bio -> IO ()
> 	remap thinId bio = spawn $ remapFn thinId bio vblock
> 	    where
> 	        vblock = virtBlock bio
> 	        remapFn = case classifyBio bio of
> 	            ReadBio -> remapRead
> 	            WriteBio -> remapWrite
> 	            DiscardBio -> remapDiscard
> 	            FlushBio -> remapFlush
> 
> 	----------------------------------------------------------
> 
> 	remapRead :: ThinId -> Bio -> VBlock -> Task ()
> 	remapRead thinId bio vblock = do
> 	    withSharedLock thinId vblock $ \_ -> do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Read, Unprovisioned, Shared/!Shared
> 	            Unprovisioned -> do
> 	                fillWithZeroes bio
> 	                complete bio Success
> 
> 	            -- Read, Provisioned, Shared/!Shared
> 	            (Provisioned dblock _) ->
> 	                remapAndWait bio dblock
> 
> 	----------------------------------------------------------
> 
> 	remapWrite :: ThinId -> Bio -> VBlock -> Task ()
> 	remapWrite thinId bio vblock = do
> 	    withSharedLock thinId vblock $ \lock -> do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Write, Unprovisioned
> 	            Unprovisioned ->
> 	                withUpgradedLock lock $
> 	                    provision thinId bio vblock
> 
> 	            -- Write, Provisioned, !Shared
> 	            (Provisioned dblock False) ->
> 	                remapAndWait bio dblock
> 
> 	            -- Write, Provisioned, Shared
> 	            (Provisioned dblock True) ->
> 	                withUpgradedLock lock $
> 	                    breakSharing thinId bio vblock dblock
> 
> 	breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
> 	breakSharing thinId bio vblock dblockOld = do
> 	    ab <- allocateBlock
> 	   case ab of
> 	       NoDataSpace ->
> 	           complete bio Failure
> 
> 	       (Allocated dblockNew) ->
> 	           withDataLock dblockOld $		-- we grab data locks to avoid races with discard
> 	               withDataLock dblockNew $ do
> 	                   copy dblockOld dblockNew
> 	                   metadataInsert thinId vblock dblockNew
> 	           remapAndWait thinId bio dblockNew
> 
> 	provision :: ThinId -> Bio -> VBlock -> Task ()
> 	provision thinId bio vblock = do
> 	    case allocateBlock of
> 	        NoDataSpace ->
> 	            complete bio Failure
> 
> 	        (Allocated dblock) ->
> 	            withDataLock dblock $ do
> 	                metadataInsert thinId vblock dblock
> 	                remapAndWait thinId bio dblock

Does the allocator block?  If so, it would be neat to pre-allocate some 
number of blocks during metadata idle times.  There could be a hidden thin 
volume (ie, devid #16777215) that blocks are allocated into and then 
stolen from for use elsewhere.  The blocks could be pre-zeroed, too!

> 	            
> 	----------------------------------------------------------
> 
> 	discard :: ThinId -> Bio -> VBlock -> Task ()
> 	discard thinId bio vblock = do
> 	    withExclusiveLock thinId vblock $ do
> 	        lr <- metadataLookup thinId vblock
> 	        case lr of
> 	            -- Discard, Unprovisioned
> 	            Unprovisioned ->
> 	                complete bio Success
> 
> 	            -- Discard, Provisioned, !Shared
> 	            (Provisioned dblock False) ->
> 	                withDataLock dblock $ do
> 	                    remapAndWait bio dblock  		-- passdown
> 	                    metadataRemove thinId dblock
> 
> 	            -- Discard, Provisioned, Shared
> 	           (Provisioned dblock True) ->
> 	               withDataLock dblock $ do
> 	                   metadataRemove thinId dblock
> 	                   complete bio Success
> 
> 	----------------------------------------------------------
> 
> 	flush :: Task ()
> 	flush = metadataCommit
> 	    
> 	----------------------------------------------------------
> 
> 	remapAndWait :: Bio -> DataBlock -> Task ()
> 	remapAndWait bio dblock = do
> 	    remap bio dblock
> 	    issue bio
> 	    wait bio
>    
> The above is a simplification (eg, discards can cover more than a single
> block, the pool has multiple modes like OUT_OF_DATA_SPACE).  But it gives
> a good idea of what the dm target needs to do, and in a succinct manner.
> 
> Now dm-thin.c is anything but succinct, for a couple of reasons:
> 
> - Because of where we are in the IO stack we cannot allocate memory.
>   This means we either use memory preallocated via mempools, or allocate
>   a fixed size block before a bio is processed.
> 
> - We don't have a magic green threads library that hides the numerous
>   blocking operations that we need.  Instead we have low level facilities
>   like workqueues etc.  This tends to have the effect of breaking up the logic
>   and scattering it across lots of little completion functions.
> 
> 
> How we handle blocking, locking, and quiescing IO are all intertwined.
> Which is why switching over to the new bio_prison interface is going to
> involve an awful lot of churn.
> 
> In the upstream code
> ====================
> 
> - Locking
> 
>   The locks provided by bio_prison (v1), are all exclusive locks.  As such
>   we take pains to hold them for as short a period as possible.  This means
>   holding them for the duration of an IO is completely out of the question.
>   Nonetheless, as pointed out in the original post for this thread, this
>   can cause bad lock contention, especially if the data block size is large.
> 
> - Quiescing
> 
>   Because we do not hold locks for the lifetime of the bios, we need
>   another way of tracking IO and quiescing regions.  This is what the
>   deferred_set component does.  Effectively it divides time up into
>   bins, and keeps a reference count of how many IOs are still in flight
>   for each bin.  To quiesce we grab a lock, and then wait for all bins
>   before this lock was acquired to drain.  Advantages of this approach
>   is it uses very little memory (I think we're currently running with
>   64 bins), and consumes v. little cpu.  But we're never specific about

curious, is "v." short for "very" (not "versus")?

>   which region we're waiting to quiesce, instead always waiting for all
>   IO older than a certain point to drain.  So we are certainly introducing
>   more latency here.
> 
> - Blocking
> 
>   A single thread services any operations that could block.
>   When there is work for this thread to perform a work queue item
>   is queued (do_worker()).  This then examines linked lists of work
>   (prepared_mappings, discard_pt1, discard_pt2, prefetches etc), and
>   processes each list as a batch.  Batching like this is a mixed blessing;
>   it allows us to sort incoming bios so we can process bios to the same
>   region at the same time, but it is also v. bad for max latency, as we
>   have no idea which piece of work was there the longest.
> 
> Next iteration of the code
> ========================== 
> 
> - Locking
> 
>   bio_prison (v2) provides shared locks, and custom lock levels.  So,
>   at the expense of memory, we can hold shared locks for long periods
>   that cover the lifetime of the bio.  Acquiring a lock now blocks.
> 
> - Quiescing
> 
>   Because we hold the locks for long periods we can now do away with the
>   deferred set completely.  If you want to quiesce a region, just grab
>   the exclusive lock associated with it, when it's finally granted you
>   know it's also quiesced.
> 

good to know.

> - Blocking
> 
>   I want to move away from the idea of a single worker function that
>   has different categories of work stored for it in different lists.
>   Instead, storing specific work structs on the work queue for each bio.
>   Partly this is to reduce latency by increasing 'fairness'.  But also
>   the fact that acquiring a lock now blocks means there are a lot more
>   block operations to handle, and we'd just end up with a lot of these
>   lists of work.  It would also allow us to have multiple kernel threads
>   servicing the workqueue.
> 
>   If you look at dm-cache-target.c you'll see this has already been
>   done for that target.  We have continuation structs that represent
>   the work to be performed after the current blocking op has completed.
>   dm-cache uses this for migrations, which have a much simpler state model
>   than dm-thin.  Even so there are a lot of these little continuation
>   functions (eg, mg_start, mg_lock_writes, mg_copy, mg_full_copy,
>   mg_upgrade_lock, mg_update_metadata_after_copy, mg_update_metadata,
>   mg_success, mg_complete).
> 
> 
> Where are we now?
> =================
> 
> I did a lot of work on this a couple of years ago.  First I just dove
> in, trying to code things up by hand.  But it quickly devolved into a
> maze of badly named continuation functions, all alike.  It's very hard
> to give these functions meaningful names; go through the pseudocode at
> the top of this email and for each place where we could block, try to
> describe where we are.  The biggest problem is as we introduce more of
> these continuations big picture logic receeds and it becomes v. hard to
> reason about the code.

Event-driven continuation functions seem to pop up frequently in the Linux 
kernel.  It would be neat if there was a framework to write these 
procedurally.  Macros might help, could still be pretty ugly.  Almost 
needs GCC support.
 
> I then experimented with automatically generating all the code from a
> simpler specification (I used a lispy version of the pseudocode above).
> This showed promise and I got it generating kernel code that would
> compile.  I was debugging this when I got dragged onto other work,
> and this has stagnated since.

Do you think this is the best way to proceed?  Someone with Lisp 
background would need to help. It might generate first-pass code but would 
be difficult to maintain as kernel changes patch the auto-generated code.

-Eric

> 
> 
> So that's where we are.
>  
> 
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-11-22 18:55 ` Joe Thornber
  2019-12-02  7:50   ` JeffleXu
@ 2019-12-06 14:15   ` Nikos Tsironis
  2019-12-13  1:32     ` Eric Wheeler
  1 sibling, 1 reply; 13+ messages in thread
From: Nikos Tsironis @ 2019-12-06 14:15 UTC (permalink / raw)
  To: thornber; +Cc: JeffleXu, dm-devel

On 11/22/19 8:55 PM, Joe Thornber wrote:
> On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:
> 
>> The first question is what's the purpose of data cell? In thin_bio_map(),
>> normal bio will be packed as a virtual cell and data cell. I can understand
>> that virtual cell is used to prevent discard bio and non-discard bio
>> targeting the same block from being processed at the same time. I find it
>> was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
>> fix race between simultaneous io and discards to same block), but I'm still
>> confused about the use of data cell.
> 
> As you are aware there are two address spaces for the locks.  The 'virtual' one
> refers to cells in the logical address space of the thin devices, and the 'data' one
> refers to the underlying data device.  There are certain conditions where we
> unfortunately need to hold both of these (eg, to prevent a data block being reprovisioned
> before an io to it has completed).
> 
>> The second question is the impact of virtual cell and data cell on IO
>> performance. If $data_block_size is large for example 1G, in multithread fio
>> test, most bio will be buffered in cell->bios list and then be processed by
>> worker thread asynchronously, even when there's no discard bio. Thus the
>> original parallel IO is processed by worker thread serially now. As the
>> number of fio test threads increase, the single worker thread can easily get
>> CPU 100%, and thus become the bottleneck of the performance since dm-thin
>> workqueue is ordered unbound.
> 
> Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
> new interface that we need to move dm-thin across to use (dm-cache already uses it).
> It allows concurrent holders of a cell (ie, read locks), so we'll be able to remap
> much more io without handing it off to a worker thread.  Once this is done I want
> to add an extra field to cells that will cache the mapping, this way if you acquire a
> cell that is already held then you can avoid the expensive btree lookup.  Together
> these changes should make a huge difference to the performance.
> 
> If you've got some spare coding cycles I'd love some help with this ;)
> 

Hi Joe,

I would be interested in helping you with this task. I can't make any
promises, but I believe I could probably spare some time to work on it.

If you think you could use the extra help, let me know.

Nikos

> - Joe
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-06 14:15   ` Nikos Tsironis
@ 2019-12-13  1:32     ` Eric Wheeler
  2019-12-15 21:44       ` Eric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wheeler @ 2019-12-13  1:32 UTC (permalink / raw)
  To: Nikos Tsironis; +Cc: JeffleXu, dm-devel, thornber

On Fri, 6 Dec 2019, Nikos Tsironis wrote:
> On 11/22/19 8:55 PM, Joe Thornber wrote:
> > On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:
> > 
> > > The first question is what's the purpose of data cell? In thin_bio_map(),
> > > normal bio will be packed as a virtual cell and data cell. I can
> > > understand
> > > that virtual cell is used to prevent discard bio and non-discard bio
> > > targeting the same block from being processed at the same time. I find it
> > > was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
> > > fix race between simultaneous io and discards to same block), but I'm
> > > still
> > > confused about the use of data cell.
> > 
> > As you are aware there are two address spaces for the locks.  The 'virtual'
> > one
> > refers to cells in the logical address space of the thin devices, and the
> > 'data' one
> > refers to the underlying data device.  There are certain conditions where we
> > unfortunately need to hold both of these (eg, to prevent a data block being
> > reprovisioned
> > before an io to it has completed).
> > 
> > > The second question is the impact of virtual cell and data cell on IO
> > > performance. If $data_block_size is large for example 1G, in multithread
> > > fio
> > > test, most bio will be buffered in cell->bios list and then be processed
> > > by
> > > worker thread asynchronously, even when there's no discard bio. Thus the
> > > original parallel IO is processed by worker thread serially now. As the
> > > number of fio test threads increase, the single worker thread can easily
> > > get
> > > CPU 100%, and thus become the bottleneck of the performance since dm-thin
> > > workqueue is ordered unbound.
> > 
> > Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
> > new interface that we need to move dm-thin across to use (dm-cache already
> > uses it).
> > It allows concurrent holders of a cell (ie, read locks), so we'll be able to
> > remap
> > much more io without handing it off to a worker thread.  Once this is done I
> > want
> > to add an extra field to cells that will cache the mapping, this way if you
> > acquire a
> > cell that is already held then you can avoid the expensive btree lookup.
> > Together
> > these changes should make a huge difference to the performance.
> > 
> > If you've got some spare coding cycles I'd love some help with this ;)
> > 
> 
> Hi Joe,
> 
> I would be interested in helping you with this task. I can't make any
> promises, but I believe I could probably spare some time to work on it.


Hi Nikos, it would be great if you are able help with the 
dm-thin port to dm-bio-prison-v2.  I'm glad to see you are interested in 
dm-thin performance too.

These are the commits that implemented dm-bio-prison-v2 in dm-cache back 
in ~4.12, maybe it can give you a good start on what the conversion might 
look like:

b29d4986d dm cache: significant rework to leverage dm-bio-prison-v2

Here's a related bugfix:

d1260e2a3 dm cache: fix race condition in the writeback mode overwrite_bio optimisation



--
Eric Wheeler


> 
> Nikos
> 
> > - Joe
> > 
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> > 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-13  1:32     ` Eric Wheeler
@ 2019-12-15 21:44       ` Eric Wheeler
  2019-12-18 13:13         ` Joe Thornber
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wheeler @ 2019-12-15 21:44 UTC (permalink / raw)
  To: thornber; +Cc: JeffleXu, dm-devel, Nikos Tsironis

On Fri, 13 Dec 2019, Eric Wheeler wrote:
> On Fri, 6 Dec 2019, Nikos Tsironis wrote:
> > On 11/22/19 8:55 PM, Joe Thornber wrote:
> > > On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:
> > > 
> > > > The first question is what's the purpose of data cell? In thin_bio_map(),
> > > > normal bio will be packed as a virtual cell and data cell. I can
> > > > understand
> > > > that virtual cell is used to prevent discard bio and non-discard bio
> > > > targeting the same block from being processed at the same time. I find it
> > > > was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
> > > > fix race between simultaneous io and discards to same block), but I'm
> > > > still
> > > > confused about the use of data cell.
> > > 
> > > As you are aware there are two address spaces for the locks.  The 'virtual'
> > > one
> > > refers to cells in the logical address space of the thin devices, and the
> > > 'data' one
> > > refers to the underlying data device.  There are certain conditions where we
> > > unfortunately need to hold both of these (eg, to prevent a data block being
> > > reprovisioned
> > > before an io to it has completed).
> > > 
> > > > The second question is the impact of virtual cell and data cell on IO
> > > > performance. If $data_block_size is large for example 1G, in multithread
> > > > fio
> > > > test, most bio will be buffered in cell->bios list and then be processed
> > > > by
> > > > worker thread asynchronously, even when there's no discard bio. Thus the
> > > > original parallel IO is processed by worker thread serially now. As the
> > > > number of fio test threads increase, the single worker thread can easily
> > > > get
> > > > CPU 100%, and thus become the bottleneck of the performance since dm-thin
> > > > workqueue is ordered unbound.
> > > 
> > > Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
> > > new interface that we need to move dm-thin across to use (dm-cache already
> > > uses it).
> > > It allows concurrent holders of a cell (ie, read locks), so we'll be able to
> > > remap
> > > much more io without handing it off to a worker thread.  Once this is done I
> > > want
> > > to add an extra field to cells that will cache the mapping, this way if you
> > > acquire a
> > > cell that is already held then you can avoid the expensive btree lookup.
> > > Together
> > > these changes should make a huge difference to the performance.
> > > 
> > > If you've got some spare coding cycles I'd love some help with this ;)
> > > 
> > 
> > Hi Joe,
> > 
> > I would be interested in helping you with this task. I can't make any
> > promises, but I believe I could probably spare some time to work on it.
> 
> 
> Hi Nikos, it would be great if you are able help with the 
> dm-thin port to dm-bio-prison-v2.  I'm glad to see you are interested in 
> dm-thin performance too.
> 
> These are the commits that implemented dm-bio-prison-v2 in dm-cache back 
> in ~4.12, maybe it can give you a good start on what the conversion might 
> look like:
> 
> b29d4986d dm cache: significant rework to leverage dm-bio-prison-v2
> 
> Here's a related bugfix:
> 
> d1260e2a3 dm cache: fix race condition in the writeback mode overwrite_bio optimisation

Hi Joe,

I was looking through the dm-bio-prison-v2 commit for dm-cache (b29d4986d) 
and it is huge, ~5k lines.  Do you still have a git branch with these 
commits in smaller pieces (not squashed) so we can find the bits that 
might be informative for converting lv-thin to use dm-bio-prison-v2?

For example, I think that, at least, the policy changes and 
btracker code is dm-cache specific and just a distraction when trying to 
understand the dm-bio-prison-v2 conversion.

--
Eric Wheeler


> 
> 
> 
> --
> Eric Wheeler
> 
> 
> > 
> > Nikos
> > 
> > > - Joe
> > > 
> > > --
> > > dm-devel mailing list
> > > dm-devel@redhat.com
> > > https://www.redhat.com/mailman/listinfo/dm-devel
> > > 
> > 
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> > 
> > 
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-15 21:44       ` Eric Wheeler
@ 2019-12-18 13:13         ` Joe Thornber
  2019-12-18 20:27           ` Eric Wheeler
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Thornber @ 2019-12-18 13:13 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: JeffleXu, dm-devel, Nikos Tsironis

On Sun, Dec 15, 2019 at 09:44:49PM +0000, Eric Wheeler wrote:
> I was looking through the dm-bio-prison-v2 commit for dm-cache (b29d4986d) 
> and it is huge, ~5k lines.  Do you still have a git branch with these 
> commits in smaller pieces (not squashed) so we can find the bits that 
> might be informative for converting lv-thin to use dm-bio-prison-v2?
> 
> For example, I think that, at least, the policy changes and 
> btracker code is dm-cache specific and just a distraction when trying to 
> understand the dm-bio-prison-v2 conversion.

To be honest I would hold off for a couple of months.  I've been working
on the design for thinp 2 and have got to the point where I need to write
a userland proof of concept implementation.  In particular I've focussed on
packing more into btree nodes, and separating transactions so IO to different
thins has no locking contention.  The proof of concept will tell me just how
small I can get the metadata.  If the level of metadata compression is ~1/10th
we'll plug the new btrees into the existing design and switch to bio prison v2.
If it's greater, say 1/50th, then I'll rewrite the whole target to
use write-ahead logging for transactionality and ditch all metadata sharing altogether.
When the metadata is that small we can copy entire btrees to implement snapshots.

- Joe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dm-thin: Several Questions on dm-thin performance.
  2019-12-18 13:13         ` Joe Thornber
@ 2019-12-18 20:27           ` Eric Wheeler
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Wheeler @ 2019-12-18 20:27 UTC (permalink / raw)
  To: Joe Thornber; +Cc: JeffleXu, dm-devel, Nikos Tsironis

On Wed, 18 Dec 2019, Joe Thornber wrote:

> On Sun, Dec 15, 2019 at 09:44:49PM +0000, Eric Wheeler wrote:
> > I was looking through the dm-bio-prison-v2 commit for dm-cache (b29d4986d) 
> > and it is huge, ~5k lines.  Do you still have a git branch with these 
> > commits in smaller pieces (not squashed) so we can find the bits that 
> > might be informative for converting lv-thin to use dm-bio-prison-v2?
> > 
> > For example, I think that, at least, the policy changes and 
> > btracker code is dm-cache specific and just a distraction when trying to 
> > understand the dm-bio-prison-v2 conversion.
> 
> To be honest I would hold off for a couple of months.  I've been working
> on the design for thinp 2 and have got to the point where I need to write
> a userland proof of concept implementation.  In particular I've focussed on
> packing more into btree nodes, and separating transactions so IO to different
> thins has no locking contention.  The proof of concept will tell me just how
> small I can get the metadata.  If the level of metadata compression is ~1/10th
> we'll plug the new btrees into the existing design and switch to bio prison v2.
> If it's greater, say 1/50th, then I'll rewrite the whole target to
> use write-ahead logging for transactionality and ditch all metadata sharing altogether.
> When the metadata is that small we can copy entire btrees to implement snapshots.


Sounds great, looking forward to it.  The thinp target has worked great 
for us over the years.  Packing metadata and reducing lock contention will 
make it even better.

--
Eric Wheeler



> 
> - Joe
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* dm-thin: Several Questions on dm-thin performance.
@ 2019-11-22  3:43 JeffleXu
  0 siblings, 0 replies; 13+ messages in thread
From: JeffleXu @ 2019-11-22  3:43 UTC (permalink / raw)
  To: agk, Mike Snitzer; +Cc: dm-devel

Hi guys,

I have several questions on dm-thin when I'm testing and evaluating IO 
performance of dm-thin. I would be grateful if someone could spend a 
little time on it.


The first question is what's the purpose of data cell? In 
thin_bio_map(), normal bio will be packed as a virtual cell and data 
cell. I can understand that virtual cell is used to prevent discard bio 
and non-discard bio targeting the same block from being processed at the 
same time. I find it was added in commit     
e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin: fix race between 
simultaneous io and discards to same block), but I'm still confused 
about the use of data cell.


The second question is the impact of virtual cell and data cell on IO 
performance. If $data_block_size is large for example 1G, in multithread 
fio test, most bio will be buffered in cell->bios list and then be 
processed by worker thread asynchronously, even when there's no discard 
bio. Thus the original parallel IO is processed by worker thread 
serially now. As the number of fio test threads increase, the single 
worker thread can easily get CPU 100%, and thus become the bottleneck of 
the performance since dm-thin workqueue is ordered unbound.

Using an nvme SSD and fio (direct=1, ioengine=libaio, iodepth=128, 
numjobs=4, rw=read, bs=4k), the bandwidth on bare nvme is 1589MiB/s. The 
bandwidth on thin device is only 1274MiB/s, while the four fio threads 
run at 200% CPU and the single worker thread is always runing at 100% 
CPU. perf of worker thread showes that process_bio() consumes 86% of the 
time.


Besides it seems that I can't send email to dm-devel@redhat.com mailing 
list.


Regards

Jeffle Xu


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-12-18 20:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-22  3:14 dm-thin: Several Questions on dm-thin performance JeffleXu
2019-11-22 18:55 ` Joe Thornber
2019-12-02  7:50   ` JeffleXu
2019-12-02 22:26     ` Eric Wheeler
2019-12-03 12:51       ` Joe Thornber
2019-12-04 12:11         ` Joe Thornber
2019-12-05 23:14           ` Eric Wheeler
2019-12-06 14:15   ` Nikos Tsironis
2019-12-13  1:32     ` Eric Wheeler
2019-12-15 21:44       ` Eric Wheeler
2019-12-18 13:13         ` Joe Thornber
2019-12-18 20:27           ` Eric Wheeler
2019-11-22  3:43 JeffleXu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.