All of lore.kernel.org
 help / color / mirror / Atom feed
* dm thin pool discarding
@ 2019-01-10  0:39 james harvey
  2019-01-10  1:09 ` james harvey
  2019-01-10  9:18 ` Zdenek Kabelac
  0 siblings, 2 replies; 11+ messages in thread
From: james harvey @ 2019-01-10  0:39 UTC (permalink / raw)
  To: dm-devel

I've been talking with ntfs-3g developers, and they're updating their
discard code to work when an NTFS volume is within an LVM thin volume.

It turns out their code was refusing to discard if discard_granularity
was > the NTFS cluster size.  By default, a LVM thick volume is giving
a discard_granularity of 512 bytes, and the NTFS cluster size is 4096.
By default, a LVM thin volume is giving a discard_granularity of 65536
bytes.

For thin volumes, LVM seems to be returning a discard_granularity
equal to the thin pool's chunksize, which totally makes sense.

Q1 - Is it correct that a filesystem's discard code needs to look for
an entire block of size discard_granularity to send to the block
device (dm/LVM)?  That dm/LVM cannot accept discarding smaller amounts
than this?  (Seems to make sense to me, since otherwise I think the
metadata would need to keep track of smaller chunks than the
chunksize, and it doesn't have the metadata space to do that.)

Q2 - Is it correct that the blocks of size discard_granularity sent to
dm/LVM need to be aligned from the start of the volume, rather than
the start of the partition?  Let's say the thin pool chunk size is set
high, like 128MB.  And, the LVM volume is given to a Virtual Machine
as a raw disk, which creates a partition table within it.  The VM is
going to "properly align" the partitions
Meaning, let's say the chunk size is set high, like 128MB.  And, the
LVM volume is given to a Virtual Machine, which creates a partition
table within it.  Using fdisk 2.33 and gpt, on a thin pool chunk size
of 128MB, it shows sectors of 512 bytes, and puts partition 1 starting
at sector 2048, so at 1MB.  If the filesystem merely considers
alignment from the beginning of where its partition is, that's not
going to line up with alignment of the beginning of the block device,
unless 1MB is a multiple of the thin pool chunk size.

Q3 - Does a LVM thin volume zero out the bytes that are discarded?  At
least for me, queue/discard_zeroes_data is 0.  I see there was
discussion on the list of adding this back in 2012, but I'm not sure
it was ever added for there to be a way to enable it.

Q4 - Are there dragons here?  If I'm right about how Q1&Q2 need to be
handled, if the filesystem incorrectly sends a discard starting at a
location not properly aligned, will LVM/dm reject the request, or will
it still perform an action?  I saw references to block devices
"rounding" discard requests which sounds really scary to me, as if a
filesystem which does this incorrectly could lead to data
corruption/loss.  (I'm not talking about the filesystem going haywire
and discarding areas it should know are in use, but rather
misunderstanding the alignment issues.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10  0:39 dm thin pool discarding james harvey
@ 2019-01-10  1:09 ` james harvey
  2019-01-10  9:18 ` Zdenek Kabelac
  1 sibling, 0 replies; 11+ messages in thread
From: james harvey @ 2019-01-10  1:09 UTC (permalink / raw)
  To: dm-devel

On Wed, Jan 9, 2019 at 7:39 PM james harvey <jamespharvey20@gmail.com> wrote:
> Q2 - Is it correct that the blocks of size discard_granularity sent to
> dm/LVM need to be aligned from the start of the volume, rather than
> the start of the partition?  ...

This is probably what discard_alignment is for.

# lvcreate --size 1G --chunksize 128MB --thin lvm/thinpoolntfs /dev/nvme0n1p4
# lvcreate --virtualsize 256M --thin lvm/thinpoolntfs --name ntfs
# fdisk /dev/lvm/ntfs
GPT
New, Partition number 1, First sector 2048 (default was 32 sectors of
512 bytes?), +128M
# kpartx -a /dev/lvm/ntfs
# dmsetup ls | grep ntfs
lvm-ntfs (254:16)
lvm-ntfs1 (254:17)
$ cat /sys/dev/block/254:16/discard_alignment
0 # makes sense, this is the volume itself
$ cat /sys/dev/block/254:17/discard_alignment
133169152

That's equal to size of the volume (128 * 1024 * 1024) minus the start
of the partition (2048 * 512 byte sector size).

discard_alignment is described at
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block

as: "Devices that support discard functionality may internally
allocate space in units that are bigger than the exported logical
block size. The discard_alignment parameter indicates how many bytes
the beginning of the device is offset from the internal allocation
unit's natural alignment."

So, is that value supposed to be how many bytes are LEFT in the first
discard granularity block?  Meaning, the fs should treat this many
bytes as non-discardable?  Instead of this value giving how many bytes
INTO the first discard granularity block the partition starts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10  0:39 dm thin pool discarding james harvey
  2019-01-10  1:09 ` james harvey
@ 2019-01-10  9:18 ` Zdenek Kabelac
  2019-01-10 11:40   ` Martin Wilck
  2019-01-15 22:55   ` james harvey
  1 sibling, 2 replies; 11+ messages in thread
From: Zdenek Kabelac @ 2019-01-10  9:18 UTC (permalink / raw)
  To: james harvey, dm-devel

Dne 10. 01. 19 v 1:39 james harvey napsal(a):
> I've been talking with ntfs-3g developers, and they're updating their
> discard code to work when an NTFS volume is within an LVM thin volume.
> 
> It turns out their code was refusing to discard if discard_granularity
> was > the NTFS cluster size.  By default, a LVM thick volume is giving
> a discard_granularity of 512 bytes, and the NTFS cluster size is 4096.
> By default, a LVM thin volume is giving a discard_granularity of 65536
> bytes.
> 
> For thin volumes, LVM seems to be returning a discard_granularity
> equal to the thin pool's chunksize, which totally makes sense.
> 
> Q1 - Is it correct that a filesystem's discard code needs to look for
> an entire block of size discard_granularity to send to the block
> device (dm/LVM)?  That dm/LVM cannot accept discarding smaller amounts
> than this?  (Seems to make sense to me, since otherwise I think the
> metadata would need to keep track of smaller chunks than the
> chunksize, and it doesn't have the metadata space to do that.)


You can always send discard of 512b sector - but it will not really do 
anything useful for thin-pool unless you discard 'whole' chunk.

That's why it is always better to use  'fstrim' - which will always try
to discard 'largest' regions.

There is nothing in thin-pool itself that would track which sectors from 
chunks were trimmed - so if you trim chunk by sectors - the chunk will still 
appear as allocated by thin volume. And obviously there is nothing
that would be 'clearing' such trimmed sectors individually. So when
you trim  512b out of thin volume - after reading same data location you will 
still find there your old data.  Only after 'trimming' whole chunk (on chunk 
boundaries) - you will get zero.  It's worth to note that every thin LV is 
composed from chunks - so to have successful trim - trimming happens only on 
aligned chunks - i.e. chunk_size == 64K and then if you try to trim 64K from 
position 32K - nothing happens....

> Q3 - Does a LVM thin volume zero out the bytes that are discarded?  At
> least for me, queue/discard_zeroes_data is 0.  I see there was
> discussion on the list of adding this back in 2012, but I'm not sure
> it was ever added for there to be a way to enable it.

Unprovisioned chunks always appear as zeroed for reading.
Once you provision chunk (by write) for thin volume out of thin-pool - it 
depends on thin-pool target setting 'skip_zeroing'.

So if zeroing is enabled (no skipping) - and you use larger chunks - the 
initial chunk provisioning becomes quite expensive - that's why lvm2 is by 
default recommending to not use zeroing for chunk sizes > 512K.

When zeroing is disabled (skipped) -  provisioning is 'fast' - but whatever 
content was 'left' on thin-pool data device will be readable from unwritten 
portions of provisioned chunks.  So you need to pick whether you care or do 
not care. Note - modern filesystems track 'written' data - so normal user can 
never see such data by reading files from filesystem - but of course root with 
'dd' command can examine any portion of such device.

I hope this makes it clear.

Zdenek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10  9:18 ` Zdenek Kabelac
@ 2019-01-10 11:40   ` Martin Wilck
  2019-01-10 11:52     ` Zdenek Kabelac
  2019-01-15 22:55   ` james harvey
  1 sibling, 1 reply; 11+ messages in thread
From: Martin Wilck @ 2019-01-10 11:40 UTC (permalink / raw)
  To: Zdenek Kabelac, james harvey, dm-devel

On Thu, 2019-01-10 at 10:18 +0100, Zdenek Kabelac wrote:
> Dne 10. 01. 19 v 1:39 james harvey napsal(a):
> > 
> > Q3 - Does a LVM thin volume zero out the bytes that are
> > discarded?  At
> > least for me, queue/discard_zeroes_data is 0.  I see there was
> > discussion on the list of adding this back in 2012, but I'm not
> > sure
> > it was ever added for there to be a way to enable it.
> 
> Unprovisioned chunks always appear as zeroed for reading.
> Once you provision chunk (by write) for thin volume out of thin-pool
> - it 
> depends on thin-pool target setting 'skip_zeroing'.
> 
> So if zeroing is enabled (no skipping) - and you use larger chunks -
> the 
> initial chunk provisioning becomes quite expensive - that's why lvm2
> is by 
> default recommending to not use zeroing for chunk sizes > 512K.

Which begs the question why lvm zeroes at provisioning time, and not at
discard time, where speed matters less (and the operation could be
carried out lazily, taking care only that the discarded blocks aren't
re-provisioned before they are zeroed).

So far my understanding was that even without zeroing, an LVM thin
volume could be considered as a drive with "discard zeroes data"
property. If there's a flaw in the argument below, please point it out
to me.

Firstly, IMO "discard" is not "secure erase". Considering an SSD, the
"discards zeroes data" property doesn't make sure that the data is
unrecoverably wiped. It just means that future attempts to read the
discarded sectors return zeroes. The data may well persist in flash
memory, and be readable to attackers with suitable tools.

Now consider a VM that uses a dm-thin volume as storage. If this VM
issues a discard operation on some chunk of data, future reads on the
discarded chunks through the same LV will return 0 because these chunks
have just become unprovisioned. That looks pretty much like "disard
zeroes data" to me. Right? Whether that data might become visible to
another VM using another thin volume is a different question, more
along the "secure erase" line of thought. The blocks in the thin pool
outside the used thin LV are a bit like the "spare area" of an SSD, at
least from the point of view of a VM.

The point I'm uncertain about is what happens if such a chunk is
(re)provisioned by a partial write (say chunk size is 1M and only 512k
is written). What data would dm-thin return from a read of the non-
overwritten part of that chunk?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 11:40   ` Martin Wilck
@ 2019-01-10 11:52     ` Zdenek Kabelac
  2019-01-10 13:41       ` Martin Wilck
  0 siblings, 1 reply; 11+ messages in thread
From: Zdenek Kabelac @ 2019-01-10 11:52 UTC (permalink / raw)
  To: Martin Wilck, james harvey, dm-devel

Dne 10. 01. 19 v 12:40 Martin Wilck napsal(a):
> On Thu, 2019-01-10 at 10:18 +0100, Zdenek Kabelac wrote:
>> Dne 10. 01. 19 v 1:39 james harvey napsal(a):
>>>
>>> Q3 - Does a LVM thin volume zero out the bytes that are
>>> discarded?  At
>>> least for me, queue/discard_zeroes_data is 0.  I see there was
>>> discussion on the list of adding this back in 2012, but I'm not
>>> sure
>>> it was ever added for there to be a way to enable it.
>>
>> Unprovisioned chunks always appear as zeroed for reading.
>> Once you provision chunk (by write) for thin volume out of thin-pool
>> - it
>> depends on thin-pool target setting 'skip_zeroing'.
>>
>> So if zeroing is enabled (no skipping) - and you use larger chunks -
>> the
>> initial chunk provisioning becomes quite expensive - that's why lvm2
>> is by
>> default recommending to not use zeroing for chunk sizes > 512K.
> 
> Which begs the question why lvm zeroes at provisioning time, and not at
> discard time, where speed matters less (and the operation could be
> carried out lazily, taking care only that the discarded blocks aren't
> re-provisioned before they are zeroed).

There are few simple answers to this.

If 'zeroing' happens at the moment of provisioning - then if you use 'small 
chunks' like 64K or 128K - in many cases there is actually no-zeroing at all - 
as the chunk is fully written during provisioning time.

So in most cases there is no associated 'extra-cost'.

Of course if chunks are big  - this no longer applies and extra time is wasted 
while zeroes goes to 'unwritten' sectors.


  > So far my understanding was that even without zeroing, an LVM thin
> volume could be considered as a drive with "discard zeroes data"
> property. If there's a flaw in the argument below, please point it out
> to me.
>

As said - if you discard 'less then aligned' chunk - nothing happens,
so it cannot be takes as like it would be always zeroing...


> Now consider a VM that uses a dm-thin volume as storage. If this VM
> issues a discard operation on some chunk of data, future reads on the
> discarded chunks through the same LV will return 0 because these chunks
> have just become unprovisioned. That looks pretty much like "disard

Thin-pool current data structures makes some operation 'cheap' and some other 
quite expensive.

i.e. you could implement some sort of 'offline' zeroing where the chunks that 
are left unused in thin-pool are 'pre-zeroed' when thin-pool has a spare 
bandwidth - but the real benefit is 'questionable' as it has been already 
mentioned - with smaller chunksizes - there are typically not so big extra 
costs....  - it might have some effect with big chunks thought - but those are 
on the other hand very inefficient with snapshots - so it usually does not 
apply for VM users....


> The point I'm uncertain about is what happens if such a chunk is
> (re)provisioned by a partial write (say chunk size is 1M and only 512k
> is written). What data would dm-thin return from a read of the non-
> overwritten part of that chunk?

Clearly unwritten portions chunk are filled with zeroes.


Zdene

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 11:52     ` Zdenek Kabelac
@ 2019-01-10 13:41       ` Martin Wilck
  2019-01-10 15:08         ` Zdenek Kabelac
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Wilck @ 2019-01-10 13:41 UTC (permalink / raw)
  To: Zdenek Kabelac, james harvey, dm-devel

On Thu, 2019-01-10 at 12:52 +0100, Zdenek Kabelac wrote:
> Dne 10. 01. 19 v 12:40 Martin Wilck napsal(a):
> > 
>   > So far my understanding was that even without zeroing, an LVM
> thin
> > volume could be considered as a drive with "discard zeroes data"
> > property. If there's a flaw in the argument below, please point it
> > out
> > to me.
> > 
> 
> As said - if you discard 'less then aligned' chunk - nothing happens,
> so it cannot be takes as like it would be always zeroing...

Yuck, that's the Catch-22 then. Sorry for having missed that on the
first pass. Wouldn't it be wise to fail discard (or only zeroout?)
requests which aren't chunk-aligned, rather than just doing nothing? 

I for one would find it very attractive if dm-thin had a mode
supporting fast zeroout.

Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 13:41       ` Martin Wilck
@ 2019-01-10 15:08         ` Zdenek Kabelac
  2019-01-10 15:23           ` Martin Wilck
  0 siblings, 1 reply; 11+ messages in thread
From: Zdenek Kabelac @ 2019-01-10 15:08 UTC (permalink / raw)
  To: Martin Wilck, james harvey, dm-devel

Dne 10. 01. 19 v 14:41 Martin Wilck napsal(a):
> On Thu, 2019-01-10 at 12:52 +0100, Zdenek Kabelac wrote:
>> Dne 10. 01. 19 v 12:40 Martin Wilck napsal(a):
>>>
>>    > So far my understanding was that even without zeroing, an LVM
>> thin
>>> volume could be considered as a drive with "discard zeroes data"
>>> property. If there's a flaw in the argument below, please point it
>>> out
>>> to me.
>>>
>>
>> As said - if you discard 'less then aligned' chunk - nothing happens,
>> so it cannot be takes as like it would be always zeroing...
> 
> Yuck, that's the Catch-22 then. Sorry for having missed that on the
> first pass. Wouldn't it be wise to fail discard (or only zeroout?)
> requests which aren't chunk-aligned, rather than just doing nothing?
> 
> I for one would find it very attractive if dm-thin had a mode
> supporting fast zeroout.


I believe '/sys/block/*/queue/discard_zeroes_data  is now always returning 
false for any device as it's been considered as unreliable logic.

So using 'discard' for zeroing is not really an option here.

Zdenek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 15:08         ` Zdenek Kabelac
@ 2019-01-10 15:23           ` Martin Wilck
  2019-01-10 15:55             ` Zdenek Kabelac
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Wilck @ 2019-01-10 15:23 UTC (permalink / raw)
  To: Zdenek Kabelac, james harvey, dm-devel

On Thu, 2019-01-10 at 16:08 +0100, Zdenek Kabelac wrote:
> Dne 10. 01. 19 v 14:41 Martin Wilck napsal(a):
> > 
> > I for one would find it very attractive if dm-thin had a mode
> > supporting fast zeroout.
> 
> I believe '/sys/block/*/queue/discard_zeroes_data  is now always
> returning 
> false for any device as it's been considered as unreliable logic.
> 
> So using 'discard' for zeroing is not really an option here.

True. But we have "write_zeroes_max_bytes" instead. In device mapper
terms, it's "num_write_zeroes_bios", which isn't set for dm-thin. With
the logic outlined above, it could be set, AFAICT.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 15:23           ` Martin Wilck
@ 2019-01-10 15:55             ` Zdenek Kabelac
  2019-01-10 16:12               ` Martin Wilck
  0 siblings, 1 reply; 11+ messages in thread
From: Zdenek Kabelac @ 2019-01-10 15:55 UTC (permalink / raw)
  To: Martin Wilck, james harvey, dm-devel

Dne 10. 01. 19 v 16:23 Martin Wilck napsal(a):
> On Thu, 2019-01-10 at 16:08 +0100, Zdenek Kabelac wrote:
>> Dne 10. 01. 19 v 14:41 Martin Wilck napsal(a):
>>>
>>> I for one would find it very attractive if dm-thin had a mode
>>> supporting fast zeroout.
>>
>> I believe '/sys/block/*/queue/discard_zeroes_data  is now always
>> returning
>> false for any device as it's been considered as unreliable logic.
>>
>> So using 'discard' for zeroing is not really an option here.
> 
> True. But we have "write_zeroes_max_bytes" instead. In device mapper
> terms, it's "num_write_zeroes_bios", which isn't set for dm-thin. With
> the logic outlined above, it could be set, AFAICT.


I assume this is valid request for enhancement of thin-pool target.
However  this  'WRITE_SAME'  or whatever scsi low-level command is that is 
basically targeted for thinLV user.

So in this case the most effective 'zeroing' of thinLV areas is its discard.

It'd be interface abuse to use it for intentional slow physical 'zeroing' of 
allocated chunks before they are 'returned' to the pool.

On the other hand I can imagine some sort of 'new' flag - that would always 
zero chunks that are returned to the thin-pool - or it could be a message send 
before final release of thinLV.

So in that case - you would send a dm message 'SECURE_ERASE'  (or if there is 
some common interface for that) before actual lvremove call and thin-pool 
would zero out all exclusively provisioned blocks that would be returned back 
to pool.

I guess you can formulate something like this as RFE BZ.

Also note - you can do this operation yourself already today by using 
thin-tool and using some scripting-fu skills - but it's pretty hacky....


Zdenek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10 15:55             ` Zdenek Kabelac
@ 2019-01-10 16:12               ` Martin Wilck
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Wilck @ 2019-01-10 16:12 UTC (permalink / raw)
  To: Zdenek Kabelac, james harvey, dm-devel

On Thu, 2019-01-10 at 16:55 +0100, Zdenek Kabelac wrote:
> Dne 10. 01. 19 v 16:23 Martin Wilck napsal(a):
> > On Thu, 2019-01-10 at 16:08 +0100, Zdenek Kabelac wrote:
> > > Dne 10. 01. 19 v 14:41 Martin Wilck napsal(a):
> > > > I for one would find it very attractive if dm-thin had a mode
> > > > supporting fast zeroout.
> > > 
> > > I believe '/sys/block/*/queue/discard_zeroes_data  is now always
> > > returning
> > > false for any device as it's been considered as unreliable logic.
> > > 
> > > So using 'discard' for zeroing is not really an option here.
> > 
> > True. But we have "write_zeroes_max_bytes" instead. In device
> > mapper
> > terms, it's "num_write_zeroes_bios", which isn't set for dm-thin.
> > With
> > the logic outlined above, it could be set, AFAICT.
> 
> I assume this is valid request for enhancement of thin-pool target.
> However  this  'WRITE_SAME'  or whatever scsi low-level command is
> that is 
> basically targeted for thinLV user.
> 
> So in this case the most effective 'zeroing' of thinLV areas is its
> discard.

That's what I have in mind. Basically, one could implement a
"write_zeroes" operation in a similar manner as discard is implemented
now. We'd just need to be careful with the case where a write_zeroes
request isn't aligned to chunk size (aka discard_granularity), and
instead of a noop, either return an error, or fall back to "slow"
zeroing for the unaligned pieces. I believe that wouldn't be a big
problem in practice - prudent users would respect the granularity
anyway.

Regards
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: dm thin pool discarding
  2019-01-10  9:18 ` Zdenek Kabelac
  2019-01-10 11:40   ` Martin Wilck
@ 2019-01-15 22:55   ` james harvey
  1 sibling, 0 replies; 11+ messages in thread
From: james harvey @ 2019-01-15 22:55 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: dm-devel

On Thu, Jan 10, 2019 at 4:18 AM Zdenek Kabelac <zkabelac@redhat.com> wrote:
>
> Dne 10. 01. 19 v 1:39 james harvey napsal(a):
> > Q1 - Is it correct that a filesystem's discard code needs to look for
> > an entire block of size discard_granularity to send to the block
> > device (dm/LVM)?  ...
>
> ... Only after 'trimming' whole chunk (on chunk
> boundaries) - you will get zero.  It's worth to note that every thin LV is
> composed from chunks - so to have successful trim - trimming happens only on
> aligned chunks - i.e. chunk_size == 64K and then if you try to trim 64K from
> position 32K - nothing happens....

If chunk_size == 64K, and you try to trim 96K from position 32K, with
bad alignment, would the last 64K get trimmed?

> I hope this makes it clear.
>
> Zdenek

Definitely, thanks!


If an LVM thin volume has a partition within it, which is not aligned
with discard_granularity, and that partition is exposed using kpartx,
I'm pretty sure LVM/dm/kpartx is computing discard_alignment
incorrectly.

It's defined here:
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-block ---
as: "Devices that support discard functionality may internally
allocate space in units that are bigger than the exported logical
block size. The discard_alignment parameter indicates how many bytes
the beginning of the device is offset from the internal allocation
unit's natural alignment."

I emailed the linux-kernel list, also sending to Martin Petersen,
listed as the contact for the sysfs entry.  See
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1895560.html
-- He replied, including:

"The common alignment scenario is 3584 on a device with 4K physical
blocks. That's because of the 63-sector legacy FAT partition table
offset. Which essentially means that the first LBA is misaligned and
the first aligned [L]BA is 7."

So, there, I think he's saying given:
* A device of 4K physical blocks
* The first partition being at sector 63 (512 bytes each)

Then discard_alignment should be 63*512 mod 4096, which is 3584.
Meaning, the offset from the beginning of the allocation unit that
holds the beginning of the block device (here, a partition), to the
beginning of the block device.


But, LVM/dm/kpartx seems to be calculating it in reverse, instead
giving the offset from where the block device (partition) starts to
the beginning of the NEXT allocation unit.  Given:
* An LVM thin volume with chunk_size 128MB
* The first partition being at sector 2048 (512 bytes each)

I would expect discard_alignment to be 1MB (2048 sectors * 512
bytes/sector.)  But, LVM/dm/kpartx is giving 127MB (128MB chunk_size -
2048 sectors * 512 bytes/sector.)


I don't know how important this is.  If I understand all of this
correctly, I think it just potentially reduces how many areas are
trimmed.


I ran across this using small values, while figuring out why ntfs-3g
wasn't discarding when on an LVM thin volume.  Putting a partition
within the LVM thin volume is meant to be a stand-in for giving it to
a VM which would have its own partition table.

It appears fdisk typically forces a partition's first sector to be at
a minimum of the chunk_size.  Without looking at the code, I'm
guessing it's using I/O size (optimal.)  But, since I was using really
small values in my test, I think I found that at some point, fdisk
starts allowing the partition's first sector to be much earlier, as in
my scenario it would be starting the partition halfway through the
disk.  Where in the example below it allows a starting sector of 34,
and the user chooses 2048 to at least 1MB align), with a larger
volume, it allows a starting sector of 262144 (=128MB chunk size.)

But, this probably reproduced much more commonly in real applications
by giving the LVM thin volume to a VM, then later using it in the host
through kpartx.  At least in the case of QEMU, within the guest OS,
discard_alignment is 0, even if within the host it has a different
value.  Reported to QEMU here:
https://bugs.launchpad.net/qemu/+bug/1811543 -- So, within the guest,
fdisk is going to immediately allow the first partition to begin at
sector 2048.


How to reproduce this on one system, without VM's involved:

# pvcreate /dev/sdd1
  Physical volume "/dev/sdd1" successfully created.
# pvs | grep sdd1
  /dev/sdd1          lvm2 ---  <100.00g <100.00g
# vgextend lvm /dev/sdd1
  Volume group "lvm" successfully extended
# lvcreate --size 1g --chunksize 128M --zero n --thin lvm/tmpthinpool /dev/sdd1
  Thin pool volume with chunk size 128.00 MiB can address at most
31.62 PiB of data.
  Logical volume "tmpthinpool" created.
# lvcreate --virtualsize 256M --thin lvm/tmpthinpool --name tmp
  Logical volume "tmp" created.
# fdisk /dev/lvm/tmp
...
Command (m for help): g
Created a new GPT disklabel (GUID: 7D31AE50-32AA-BC47-9D7B-CFD6497D520B).

Command (m for help): n
Partition number (1-128, default 1):
First sector (34-524254, default 40): 2048  **** This is what allows
this problem ****
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-524254, default 524254):

Created a new partition 1 of type 'Linux filesystem' and of size 255 MiB.

Command (m for help): p
Disk /dev/lvm/tmp: 256 MiB, 268435456 bytes, 524288 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 134217728 bytes

# kpartx -a /dev/lvm/tmp
# dmsetup ls | grep tmp
lvm-tmp (254:13)
lvm-tmp1        (254:14)
lvm-tmpthinpool-tpool   (254:8)
lvm-tmpthinpool_tdata   (254:7)
lvm-tmpthinpool_tmeta   (254:6)
lvm-tmpthinpool (254:9)
$ cat /sys/dev/block/254:13/discard_alignment
0
(All good, on the LV itself)
$ cat /sys/dev/block/254:14/discard_alignment
133169152

That's the value that I think is wrong.  It's reporting the chunk size
- the location of the partition, or 128*1024*1024 - 512 bytes/sector *
2048 sectors.

I think it should be 1048576 (512 bytes/sector * 2048 sectors.)

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-01-15 22:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-10  0:39 dm thin pool discarding james harvey
2019-01-10  1:09 ` james harvey
2019-01-10  9:18 ` Zdenek Kabelac
2019-01-10 11:40   ` Martin Wilck
2019-01-10 11:52     ` Zdenek Kabelac
2019-01-10 13:41       ` Martin Wilck
2019-01-10 15:08         ` Zdenek Kabelac
2019-01-10 15:23           ` Martin Wilck
2019-01-10 15:55             ` Zdenek Kabelac
2019-01-10 16:12               ` Martin Wilck
2019-01-15 22:55   ` james harvey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.