Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
       [not found] <1872684910.4114972.1463547443287.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-18  4:57 ` matthew patton
  2016-05-18 14:20   ` Xen
  0 siblings, 1 reply; 23+ messages in thread
From: matthew patton @ 2016-05-18  4:57 UTC (permalink / raw)
  To: LVM general discussion and development

Xen wrote:

<quote> So there are two different cases as mentioned: existing block writes, 
 and new block writes. What I was gabbing about earlier would be forcing 
 a filesystem to also be able to distuinguish between them. You would 
 have a filesystem-level "no extend" mode or "no allocate" mode that gets 
 triggered. Initially my thought was to have this get triggered trough 
 the FS-LVM interface. But, it could also be made operational not through 
 any membrane but simply by having a kernel (module) that gets passed 
 this information. In both cases the idea is to say: the filesystem can 
 do what it wants with existing blocks, but it cannot get new ones.
</quote>

You still have no earthly clue how the various layers work, apparently. For the FS to "know" which of it's blocks can be scribbled on and which can't means it has to constantly poll the block layer (the next layer down may NOT necessarily be LVM) on every write. Goodbye performance.

<quote>
 However, it does mean the filesystem must know the 'hidden geometry' 
 beneath its own blocks, so that it can know about stuff that won't work 
 anymore.
</quote>

I'm pretty sure this was explained to you a couple weeks ago: it's called "integration". For 50 years filesystems were DELIBERATELY written to be agnostic if not outright ignorant of the underlying block device's peculiarities. That's how modular software is written. Sure, some optimizations have been made by peaking into attributes exposed by the block layer but those attributes don't change over time. They are probed at newfs() time and never consulted again.

Chafing at the inherent tradeoffs caused by "lack of knowledge" was why BTRFS and ZFS were written. It is  ignorant to keep pounding the "but I want XFS/EXT+LVM to be feature parity with BTRFS". It's not supposed to, it was never intended and it will never happen. So go use the tool as it's designed or go use something else that tickles your fancy.

<quote>
 Will mention that I still haven't tested --errorwhenfull yet.
</quote>

But you conveniently overlook the fact that the FS is NOT remotely full using any of the standard tools - all of a sudden the FS got signaled that the block layer was denying write BIO calls. Maybe there's a helpful kern.err in syslog that you wrote support for? 

<quote>
 In principle if you had the means to acquire such a  flag/state/condition, and the
 filesystem would be able to block new  allocation wherever whenever, you would already
 have a working system.  So what is then non-trivial?
...
 It seems completely obvious that to me at this point, if anything from 
 LVM (or e.g. dmeventd) could signal every filesystem on every affected
 thin volume, to enter a do-not-allocate state, and filesystems would be 
 able to fail writes based on that, you would already have a solution
</quote>

And so therefore in order to acquire this "signal" every write has to be done in synchronous fashion and making sure strict data integrity is maintained vis-a-vis filesystem data and metadata. Tweaking kernel dirty block size and flush intervals are knobs that you can be turned to "signal" user-land that write errors are happening. There's no such thing as "immediate" unless you use synchronous function calls from userland.

If you want to write your application to handle "mis-behaved" block layers, then use O-DIRECT+SYNC.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-18  4:57 ` [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed? matthew patton
@ 2016-05-18 14:20   ` Xen
  0 siblings, 0 replies; 23+ messages in thread
From: Xen @ 2016-05-18 14:20 UTC (permalink / raw)
  To: LVM general discussion and development

matthew patton schreef op 18-05-2016 6:57:

Just want to say your belligerent emails are ending up in the trash can. 
Not automatically, but after scanning, mostly.

At the same time perhaps it is worth noting that although all other 
emails from this list end up in my main email box just fine, except that 
yours (and yours alone) trigger the inbred spamfilter of my email 
provider, even though I have never trained it to spam your emails.

Basically, each and every time I will find your messages in my spam box. 
Makes you think, eh? But then, just for good measure, let me just 
concisely respond to this one:

> For the FS to "know" which of it's blocks can be scribbled
> on and which can't means it has to constantly poll the block layer
> (the next layer down may NOT necessarily be LVM) on every write.
> Goodbye performance.

Simply false and I explained already that given that the filesystem is 
already getting optimized for alignment with (possible) "thin" blocks 
(Zdenek has mentioned this) in order to more efficiently allocate (cause 
allocation) on the underlying layer, if it already has knowledge about 
this alignment, and it has knowledge about its own block usage, meaning 
that it can easily discover which of the "alignment" blocks it has 
already written to itself, then it has all the data and all the 
knowledge to know which blocks (extents) are completely "free". 
Supposing you had a 4KB blockmap (bitmap).

Now supposing you have 4MB extents.

Then every 10 bits in the blockmap corresponds to one bit in the extent 
map. You know this.

To condense the free blockmap into a free extent map:

(bit "0" is free, bit "1" is in use):

For every extent:

blockmap_segment = blockmap & (1023 << (extent number * 1024);
is_an_empty_extent = blockmap_segment > 0;

So it knows clearly which extents are empty.

Then it can simply be told not to write to those extents anymore.

If the filesystem is already using discards (mount option) then in 
practice those extents will also be unallocated by thin LVM.

So the filesystem knows which blocks (extents) will cause allocation, if 
it knows it is sitting on a thin device like that.

> <quote>
>  However, it does mean the filesystem must know the 'hidden geometry'
>  beneath its own blocks, so that it can know about stuff that won't 
> work
>  anymore.
> </quote>
> 
> I'm pretty sure this was explained to you a couple weeks ago: it's
> called "integration".

You dumb faced idiot. You know full well this information is already 
there. What are you trying to do here? Send me into the woods again?

For a long time harddisks have shed their geometry data onto us.

And filesystems can be created with geometry information (of a certain 
kind) in mind. Yes, these are creation flags.

But extent alignment is also a creation flag. The extent alignment, or 
block size, does not change over time all of a sudden. Not that it 
should matter that much principially. But this information can simply be 
had. It is no different that knowing the size of the block device to 
begin with.

If the creation tools would be LVM-aware (they don't have to be) the 
administrator could easily SET these parameters without any interaction 
with the block layer itself. They can already do this for flags such as:

stride=stride-size
     Configure the filesystem for a RAID array with stride-size
     filesystem blocks. This is the number of blocks read or written
     to disk before moving to next disk.  This mostly affects placement
     of filesystem metadata like bitmaps at mke2fs(2) time to avoid
     placing them on a single disk, which can hurt the performance.
     It may also be used by block allocator.

stripe_width=stripe-width
     Configure the filesystem for a RAID array with stripe-width
     filesystem blocks per stripe. This is typically be stride-size * N,
     where N is the number of data disks in the RAID (e.g. RAID 5 N+1,
     RAID 6 N+2).  This allows the block allocator to prevent
     read-modify-write of the parity in a RAID stripe if possible when
     the data is written.

And LVM extent size is not going to be any different. Zdenek explained 
earlier:

> However what is being implemented is better 'allocation' logic for pool 
> chunk provisioning (for XFS ATM)  - as rather 'dated' methods for 
> deciding where to store incoming data do not apply with provisioned 
> chunks efficiently.

> i.e.  it's inefficient to  provision  1M thin-pool chunks and then 
> filesystem
> uses just 1/2 of this provisioned chunk and allocates next one.
> The smaller the chunk is the better space efficiency gets (and need 
> with snapshot), but may need lots of metadata and may cause 
> fragmentation troubles.

Geometry data has always been part of block device drivers and I am 
sorry I cannot do better at this point (finding the required information 
on code interfaces is hard):

struct hd_geometry {
     unsigned char heads;
     unsigned char sectors;
     unsigned short cylinders;
     unsigned long start;
};

Block devices also register block size, probably for buffers and write 
queues:

> static int bs = 512;
> module_param(bs, int, S_IRUGO);
> MODULE_PARM_DESC(bs, "Block size (in bytes)");

You know more about the system than I do, and yet you say these stupid 
things.

> For Read/Write alignment still the physical geometry is the limiting 
> factor.

Extent alignment can be another parameter, and I think Zdenek explains 
that the ext and XFS guys are already working on improving efficiency 
based on that.

These are parameters supplied by the administrator (or his/her tools). 
They are not dynamic communications from the block layer, but can be set 
at creation time.

However, the "partial read-only" mode I proposed is not even a 
filesystem parameter, but something that would be communicated by a 
kernel module to the required filesystem. (Driver!). NOT through its 
block interface, but from the outside.

No different from a remount ro. Not even much different from a umount.

And I am saying these things now, I guess, because there was no support 
for a more detailed, more fully functioning solution.

> For 50 years filesystems were DELIBERATELY
> written to be agnostic if not outright ignorant of the underlying
> block device's peculiarities. That's how modular software is written.
> Sure, some optimizations have been made by peaking into attributes
> exposed by the block layer but those attributes don't change over
> time. They are probed at newfs() time and never consulted again.

LVM extent size for a LV is also not going to change over time.

The only other thing that was mentioned was for a filesystem-aware 
kernel module to send a message to a filesystem (driver) to change its 
mode of operation. Not directly through the inter-layer communication. 
But from the outside. Much like perhaps tune2fs could, or something 
similar. But this time with a function call.

> Chafing at the inherent tradeoffs caused by "lack of knowledge" was
> why BTRFS and ZFS were written. It is  ignorant to keep pounding the
> "but I want XFS/EXT+LVM to be feature parity with BTRFS". It's not
> supposed to, it was never intended and it will never happen. So go use
> the tool as it's designed or go use something else that tickles your
> fancy.

What is going to happen or not is not for you to decide. You have no say 
in the matter whatsoever, if all you do is bitch about what other people 
do, but you don't do anything yourself.

Also you have no business ordering people around here, I believe, unless 
you are some super powerful or important person, which I really doubt 
you are.

People in general in Linux have this tendency to boss basically everyone 
else around.

Mostly that bossing around is exactly the form you use here "do this, or 
don't do that". As if they have any say in the lives of other people.

> <quote>
>  Will mention that I still haven't tested --errorwhenfull yet.
> </quote>
> 
> But you conveniently overlook the fact that the FS is NOT remotely
> full using any of the standard tools - all of a sudden the FS got
> signaled that the block layer was denying write BIO calls. Maybe
> there's a helpful kern.err in syslog that you wrote support for?

Oh, how cynical we are again. You are so very lovely, I instantly want 
to marry you.

You know full well I am still in the "designing" stages. And you are 
trying to cut short design by saying or implying that only 
implementation matters, thereby trying to destroy the design phase that 
is happening now, ensuring that no implementation will ever arise.

So you are not sincere at all and your incessant remarks about needing 
implementation and code are just vile attacks trying to prevent 
implementation and code from ever arising in full.

And this you do constantly here. So why do you do it? Do you believe 
that you cannot trust the maintainers of this product to make sane 
choices in the face of something stupid? Or are you really afraid of 
sane things because you know that if they get expressed, they might make 
it to the program which you don't like?

I think it is either of both, but both look bad on you.

Either you have no confidence in the maintainers making the choices that 
are right for them, or you are afraid of choices that would actually 
improve things (but perhaps to your detriment, I don't know).

So what are you trying to fight here? Your own insanity? :P.

You conveniently overlook the fact that in current conditions, what you 
say just above is ALREADY TRUE. THE FILE SYSTEM IS NOT FULL GIVEN 
STANDARD TOOLS AND THE SYSTEM FREEZES DEAD. THAT DOES NOT CHANGE HERE 
except the freezing part.

I mean, what gives. You are now criticising a solution that allows us to 
live beyond death, when otherwise death would occur. But, it is not 
perfect enough for you, so you prefer a hard reboot over a system that 
keeps functioning in the face of some numbers no longer adding up?????? 
Or maybe I read you wrong here and you would like a solution, but you 
don't think this is it.

I have heard very few solutions from your side though, in those weeks 
past.

The only thing you have ever mentioned back then was some shell 
scripting stuff, If I remember any sanity here.

> <quote>
>  In principle if you had the means to acquire such a
> flag/state/condition, and the
>  filesystem would be able to block new  allocation wherever whenever,
> you would already
>  have a working system.  So what is then non-trivial?
> ...
>  It seems completely obvious that to me at this point, if anything from
>  LVM (or e.g. dmeventd) could signal every filesystem on every affected
>  thin volume, to enter a do-not-allocate state, and filesystems would 
> be
>  able to fail writes based on that, you would already have a solution
> </quote>
> 
> And so therefore in order to acquire this "signal" every write has to
> be done in synchronous fashion and making sure strict data integrity
> is maintained vis-a-vis filesystem data and metadata. Tweaking kernel
> dirty block size and flush intervals are knobs that you can be turned
> to "signal" user-land that write errors are happening. There's no such
> thing as "immediate" unless you use synchronous function calls from
> userland.

I'm sorry, you know a lot but you mentioned such "hints" before; 
tweaking existing functionality for stuff they were not meant for.

Why are you trying to seek solutions within the bounds of the existing? 
They can never work. You are basically trying to create that 
"integration" you so despise without actively saying you are doing so, 
instead, you seek hidden agenda's, devious schemes, to communicate the 
same thing without changing those interfaces. You are tying to the same 
thing, but you are just not owning up to it.

No, the signal would be something calling an existing (or new) system 
function in the filesystem driver from the (presiding) (LVM) module (or 
kernel part). In fact, you would not directly call the filesystem 
driver, probably you would call the VFS which would call the filesystem 
driver.

Just a function call.

I am talking about this thing:

struct super_operations {
         void (*write_super_lockfs) (struct super_block *);
         void (*unlockfs) (struct super_block *);
         int (*remount_fs) (struct super_block *, int *, char *);
         void (*umount_begin) (struct super_block *);
};

Something could be done something around there. I'm sorry I haven't 
found the relative parts yet. My foot is hurting and I put some cream on 
it, but it kinda disrupts my concentration here.

I have an infected and swollen foot, every day now.

No bacterial infection. A failed operation.

Sowwy.

> If you want to write your application to handle "mis-behaved" block
> layers, then use O-DIRECT+SYNC.

You are trying to do the complete opposite of what I'm trying to do, 
aren't you.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-24 14:28             ` Gionatan Danti
@ 2016-05-24 17:17               ` Zdenek Kabelac
  0 siblings, 0 replies; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-24 17:17 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Zdenek Kabelac

On 24.5.2016 16:28, Gionatan Danti wrote:
> Il 24-05-2016 16:17 Zdenek Kabelac ha scritto:
>>
>>
>> Dmeventd should not talk to lvmetad at all - I'm saying this for years....
>>
>> There are some very very hard to fix (IMHO) design issues - and
>> locking lvmetad in memory would be just one of wrong (IMHO) ways
>> forward....
>>
>> Anyway - let's see how it evolves here as there are further troubles
>> with lvmetad & dmeventd - see i.e. here:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1339210
>
> I'll follow it ;)
>
>> One more, somewhat related thing: when thin pool goes full, is a good
>> thing to remount an ext3/4 in readonly mode (error=remount-ro). But
>> what to do with XFS which, AFAIK, does not support a similar
>> readonly-on-error policy?
>>
>> It is my understanding that upstream XFS has some improvements to
>> auto-shutdown in case of write errors. Did these improvements already
>> tickle to production kernels (eg: RHEL6 and 7)?
>
> Any thoughts/suggestions on that?


Surely they are pushed ASAP when tests passes  (upstream first policy here 
fully applies)

So yes  6.8 has surely improvements.

Regards


Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision  and autoextend disabled - lvmetad crashed?
  2016-05-24 14:17           ` Zdenek Kabelac
@ 2016-05-24 14:28             ` Gionatan Danti
  2016-05-24 17:17               ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Gionatan Danti @ 2016-05-24 14:28 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Zdenek Kabelac

Il 24-05-2016 16:17 Zdenek Kabelac ha scritto:
> 
> 
> Dmeventd should not talk to lvmetad at all - I'm saying this for 
> years....
> 
> There are some very very hard to fix (IMHO) design issues - and
> locking lvmetad in memory would be just one of wrong (IMHO) ways
> forward....
> 
> Anyway - let's see how it evolves here as there are further troubles
> with lvmetad & dmeventd - see i.e. here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1339210

I'll follow it ;)

> One more, somewhat related thing: when thin pool goes full, is a good
> thing to remount an ext3/4 in readonly mode (error=remount-ro). But
> what to do with XFS which, AFAIK, does not support a similar
> readonly-on-error policy?
> 
> It is my understanding that upstream XFS has some improvements to
> auto-shutdown in case of write errors. Did these improvements already
> tickle to production kernels (eg: RHEL6 and 7)?

Any thoughts/suggestions on that?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-24 13:45         ` Gionatan Danti
@ 2016-05-24 14:17           ` Zdenek Kabelac
  2016-05-24 14:28             ` Gionatan Danti
  0 siblings, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-24 14:17 UTC (permalink / raw)
  To: linux-lvm

On 24.5.2016 15:45, Gionatan Danti wrote:
> Il 18-05-2016 15:47 Gionatan Danti ha scritto:
>>
>> One question: I did some test (on another machine), deliberately
>> killing/stopping the lvmetad service/socket. When the pool was almost
>> full, the following entry was logged in /var/log/messages
>>
>> WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
>>
>> So it appears than when lvmetad is gracefully stopped/not running,
>> dmeventd correctly resort to device scanning. On the other hand, in
>> the previous case, lvmetad was running but returned "Connection
>> refused". Should/could dmeventd resort to device scanning in this case
>> also?
>>
>> ...
>>
>> Very probable. So, after a LVM update, is best practice to restart the
>> machine or at least the dmeventd/lvmetad services?
>>
>> One more, somewhat related thing: when thin pool goes full, is a good
>> thing to remount an ext3/4 in readonly mode (error=remount-ro). But
>> what to do with XFS which, AFAIK, does not support a similar
>> readonly-on-error policy?
>>
>> It is my understanding that upstream XFS has some improvements to
>> auto-shutdown in case of write errors. Did these improvements already
>> tickle to production kernels (eg: RHEL6 and 7)?
>>
>> Thanks.
>
> Sorry for the bump, I would really like to know your opinions on the above


Dmeventd should not talk to lvmetad at all - I'm saying this for years....

There are some very very hard to fix (IMHO) design issues - and locking 
lvmetad in memory would be just one of wrong (IMHO) ways forward....

Anyway - let's see how it evolves here as there are further troubles
with lvmetad & dmeventd - see i.e. here:

https://bugzilla.redhat.com/show_bug.cgi?id=1339210

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision  and autoextend disabled - lvmetad crashed?
  2016-05-18 13:47       ` Gionatan Danti
@ 2016-05-24 13:45         ` Gionatan Danti
  2016-05-24 14:17           ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Gionatan Danti @ 2016-05-24 13:45 UTC (permalink / raw)
  To: LVM general discussion and development

Il 18-05-2016 15:47 Gionatan Danti ha scritto:
> 
> One question: I did some test (on another machine), deliberately
> killing/stopping the lvmetad service/socket. When the pool was almost
> full, the following entry was logged in /var/log/messages
> 
> WARNING: Failed to connect to lvmetad. Falling back to internal 
> scanning.
> 
> So it appears than when lvmetad is gracefully stopped/not running,
> dmeventd correctly resort to device scanning. On the other hand, in
> the previous case, lvmetad was running but returned "Connection
> refused". Should/could dmeventd resort to device scanning in this case
> also?
> 
> ...
> 
> Very probable. So, after a LVM update, is best practice to restart the
> machine or at least the dmeventd/lvmetad services?
> 
> One more, somewhat related thing: when thin pool goes full, is a good
> thing to remount an ext3/4 in readonly mode (error=remount-ro). But
> what to do with XFS which, AFAIK, does not support a similar
> readonly-on-error policy?
> 
> It is my understanding that upstream XFS has some improvements to
> auto-shutdown in case of write errors. Did these improvements already
> tickle to production kernels (eg: RHEL6 and 7)?
> 
> Thanks.

Sorry for the bump, I would really like to know your opinions on the 
above remarks.
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 13:48     ` Zdenek Kabelac
@ 2016-05-18 13:47       ` Gionatan Danti
  2016-05-24 13:45         ` Gionatan Danti
  0 siblings, 1 reply; 23+ messages in thread
From: Gionatan Danti @ 2016-05-18 13:47 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: 'Gionatan Danti'

On 17/05/2016 15:48, Zdenek Kabelac wrote:
>
> Yes - in general - you've witnessed  general tool failure,
> and dmeventd is not 'smart' to recognize the reason of failure.
>
> Normally this 'error' should not happen.
>
> And while I'd even say there could have been a 'shortcut'
> without even reading VG 'metadata' - since there is profile support,
> it can't be known (100% threshold) without actually reading metadata
> (so it's quite tricky case anyway)
>

One question: I did some test (on another machine), deliberately 
killing/stopping the lvmetad service/socket. When the pool was almost 
full, the following entry was logged in /var/log/messages

WARNING: Failed to connect to lvmetad. Falling back to internal scanning.

So it appears than when lvmetad is gracefully stopped/not running, 
dmeventd correctly resort to device scanning. On the other hand, in the 
previous case, lvmetad was running but returned "Connection refused". 
Should/could dmeventd resort to device scanning in this case also?

>
>
> Assuming you've been bitten by this one:
>
> https://bugzilla.redhat.com/1334063
>
> possibly? targeted by this commit:
>
> https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=7ef152c07290c79f47a64b0fc81975ae52554919
>

Very probable. So, after a LVM update, is best practice to restart the 
machine or at least the dmeventd/lvmetad services?

One more, somewhat related thing: when thin pool goes full, is a good 
thing to remount an ext3/4 in readonly mode (error=remount-ro). But what 
to do with XFS which, AFAIK, does not support a similar 
readonly-on-error policy?

It is my understanding that upstream XFS has some improvements to 
auto-shutdown in case of write errors. Did these improvements already 
tickle to production kernels (eg: RHEL6 and 7)?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-18  1:34                   ` Xen
@ 2016-05-18 12:15                     ` Zdenek Kabelac
  0 siblings, 0 replies; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-18 12:15 UTC (permalink / raw)
  To: LVM general discussion and development

On 18.5.2016 03:34, Xen wrote:
> Zdenek Kabelac schreef op 18-05-2016 0:26:
>> On 17.5.2016 22:43, Xen wrote:
>>> Zdenek Kabelac schreef op 17-05-2016 21:18:
>>>
>>> I don't know much about Grub, but I do know its lvm.c by heart now almost :p.
>>
>> lvm.c by grub is mostly useless...
>
> Then I feel we should take it out and not have grub capable of booting LVM
> volumes anymore at all, right.

It's not properly parsing and building lvm2 metadata - it's a 'reverse 
engineered' code to handle couple 'most common' metadata layouts.

But it happens most users are happy with it.

So for now using 'boot' partition is advised until proper lvm2 metadata
parser becomes integral part of Grub.

>> ATM user needs to write his own monitoring plugin tool to switch to
>> read-only volumes - it's really as easy as running bash script in loop.....
>
> So you are saying every user of thin LVM must individually, that means if
> there are a 10.000 users, you now have 10.000 people needing to write the same

Only very few of them will write something - and they may propose their 
scripts for upstream inclusion...

> I take it by that loop you mean a sleep loop. It might also be that logtail
> thing and then check for the dmeventd error messages in syslog. Right? And

dmeventd is also 'sleep loop' in this sense (although smarter...)

> First hit is CentOS. Second link is reddit. Third link is Redhat. Okay it
> should be "lvm guide" not "lvm book". Hasn't been updated since 2006 and no
> advanced information other than how to compile and install....

Dammed Google, he knows about you, that you like Centos and reddit :)
I do get quite different set of links :)

> I mean: http://tldp.org/HOWTO/LVM-HOWTO/. So what people are really going to
> know this stuff except the ones that are on this list?

We do maintain man pages - not feeling responsible for any HOWTO/blogs around 
the world.

And of course you can learn a lot here as well:
https://access.redhat.com/documentation/en/red-hat-enterprise-linux/

>
> How to find out about vgchange -ay without having internet access.........

Now just imagine you would need to configure your network from command line 
with broken NetworkManager package....

> maybe a decade or longer. Not as a developer mostly, as a user. And the thing
> is just a cynical place. I mean, LOOK at Jira:
>
> https://issues.apache.org/jira/browse/log4j2/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel
>

Being cynical myself - unsure what's better in  URL name issues.apache.org 
compared bugzilla.redhat.com...  Obviously we do have all sorts of flags in RHBZ.

>> Well the question was not asking for your 'technical' proposal, as you
>> have no real idea how it works and your visions/estimations/guesses
>> have no use at all (trust me - far deeper thinking was considered so
>> don't even waste your time to write those sentences...)
>
> Well you can drop the attitude you know. If you were doing so great, you would
> not be having a total lack of all useful documentation to begin with. You
> would not have a system that can freeze the entire system by default, because
> "policy" is apparently not well done.

Yep - and you probably think you help us a lot to realize this...

But you may a bit 'calm down' - we really know all the troubles and even far 
more then you can even think of - and surprise  - we actively work on them.

> I think the commands themselves and their way of being used, is outstanding,
> they are intuitive, they are much better than many other systems out there
> (think mdadm). It takes hardly no pain to remember how to use e.g. lvcreate,

Design simply takes time - and many things are tried...

Of course Red Hat could have been cooking something for 10 years secretly 
before going public - but the philosophy is - upstream first, release often 
and only released code does matter.

So yeah - some people are writing novels on lists and some others are writing 
a useful code....

> You are *already* integrating e.g. extfs to more closely honour the extent
> boundaries so that it is more efficient. What I am saying is not at all out of

There is a fundamental  difference to 'read' geometry once during 'mkfs' time,
and do it every time we each write through the whole device stack ;)

>> When you fail to write an ordinary (non-thin) block device  - this
>> block is then usually 'unreadable/error' - but in thinLV case - upon
>> read you get previous 100% valid' content - so you may start to
>> imagine where it's all heading.
>
> So you mean that "unreadable/error" signifies some form of "bad sector" error.
> But if you fail to write to thinLV, doesn't that mean (in our case there) that
> the block was not allocated by thinLV? That means you cannot read from it
> either. Maybe bad example, I don't know.

I think we are heading to big 'reveal' how thinp  works.

You have  thin volume  T  and its snapshot S.

You write to block 10 of device T.

As there is snapshot S -  your write to device T needs to go to a newly 
provisioned thin-pool chunk.

You get 'write-error' back  (no more free chunks)

On read of block 10  you get perfectly valid existing content of block 10.
(and this applies to both  volumes  T & S).

And then you realize - that this 'write of block 10'  means - you were just 
updating some 'existing' file in filesystem or even filesystem journal..
There was no 'new' block allocation at filesystem level - filesystem was 
writing to the 'space' it's believed it's been already assigned to him.

So I assume maybe now some 'spark' in you head may finally appear....

> It is not either/or. What I was talking about is both. You have reliability
> and you can keep using the filesystem. The filesystem just needs to be able to
> cope with the condition that it cannot use any new blocks from the existing
> pool that it knows about. That is not extremely very different from having
> exhausted its block pool to begin with. It is really the same condition,
> except right now it is rather artificial.

Wondering how long will it take before you realize - this is exactly what
the 'threshold'  is about.

e.g. you know you are 90% full - so stop using fs - unmount it, remount it, 
shutdown it,  add new space - whatever -  but it needs to be admin to decide...

....
deleted large piece of nonsense text here
....

>
> I mean I am still wholly unaware of how concurrency works in the kernel
> (except that I know the terms) (because I've been reading some code) (such as
> RCU, refcount, spinlock, mutex, what else) but I doubt this would be a real
> issue if you did it right, but that's just me.

You need to read some books how does modern OS works  (instead of creating 
hour lengthy emails) and learn what really means there is a 'parallel work' in 
progress on a single machine with e.g. 128 CPU cores...

> If you can concurrently traverse data structures and keep everything working
> in pristine order, you know, why shouldn't you be able to 'concurrently'
> update a number.

What you effectively say here  you have 'invented' excellent bug fix, you just 
need to serialize and synchronize all writes first in your OS.

To give it 'a real world' example - you would need to degrade your linux 
kernel to  not use  page cache and use all writes in a way like:

dd if=XXX of=/my/thin/volume  bs=512 oflag=direct,sync

> Maybe that's stupid of me, but it just doesn't make sense to me.

see above...

But as said - this way it has worked in 'msdos' 198X era...

> Then you can say "Oh I give up", but still, it does not make much sense.

My only goal here is to give you enough info to stop writing
emails with no real value in it  and rather writing more useful code or doc 
instead...

>> 'extX' will switch to  'ro'  upon write failure (when configured this way).
>
> Ah, you mean errors=remount-ro. Let me see what my default is :p. (The man
> page does not mention the default, very nice....).
>
> Oh, it is continue by default. Obvious....

Common issue here is -  one user prefers  A other prefers B - that's
why we have options and users should read doc - as tools themselves
are not smart enough to figure out which fits better....

If you would ask me -   'remount,ro' is the only sane variant,
And I've learned this 'hard way' with my first failing HDD in ~199X,
where I've destroyed 50% of my data first....
(I do believe in Fedora you get remount,ro in fstab)

> But a bash loop is no solution for a real system.....

Sure if you write this loop in JBoss  it sounds way more cool :)
Whatever fits...

>>> That would normally mean that filesystem operations such as DELETE would still
>>
>> You really need to sit and think for a while what the snapshot and COW
>> does really mean, and what is all written into a filesystem  (included
>> with journal) when you delete a file.
>
> Too tired now. I don't think deleting files requires growth of filesystem. I
> can delete files on a full fs just fine.
>
> You mean a deletion on origin can cause allocation on snapshot.

It's not a 'snapshot' that allocates, it's always the thin-volume you write to 
it...

You must not 'rewrite' chunk referenced by multiple thin volumes.

It's the 'key' difference between old snapshot & thin-provisioning.

With old snapshot - blocks were first copied into many 'snapshots' (crippling 
write performance in major way) and then you have updated your origin.

With thins - referenced block is kept in place and new chunk is allocated.

So this should quickly lead you to a conclusion - ANY write in 'fs'
may cause allocation...

Anyway - I've tried hard to 'explain' and if I've still failed - I'm not good 
'teacher' and there is no reason to continue this debate.

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
       [not found] <766997897.3926921.1463545271031.JavaMail.yahoo.ref@mail.yahoo.com>
@ 2016-05-18  4:21 ` matthew patton
  0 siblings, 0 replies; 23+ messages in thread
From: matthew patton @ 2016-05-18  4:21 UTC (permalink / raw)
  To: LVM general discussion and development

Xen wrote:

> > ATM user needs to write his own monitoring plugin tool to switch to
> > read-only volumes - it's really as easy as running bash script in 
> > loop.....

> needing to write  the same thing, while first having to acquire the knowledge
> of how to do  it.

Well, Xen immortalize your name in perpetuity by providing this bash script so it can be included in the tool set. Write less hundred line emails, and instead write more code for peer review. 

> I would probably be more than happy to write documentation at some point 

Except you can't be bothered to read source... Or spend the time to write the aforementioned script which will necessarily teach you a thing or two about the inner workings and events with their outcomes. When you put in the effort the community will no doubt appreciate your newly written documentation. Hop to it, eh?

> Debian/Ubuntu systems automatically activating the vg upon opening a 
> LUKS container, but then the OpenSUSE rescue environment not doing that.

Umm, "rescue environments" are deliberately designed to be more careful about what they do. Are you comparing a OpenSUSE rescue to a Deb/Ubuntu rescue? Or to a Deb "normal" environment? How long have you been around Linux that you are surprised that very different distributions might have different "defaults"?

> How to find out about vgchange -ay without having internet  access.........

man (8) vgchange perhaps? man -k lvm?

> So for me it has been a hard road to begin with and I am still learning.

They maybe less bitching, more experimenting, more practical experience, more man page reading, more knowledge retention, some B12 supplements perhaps? And stick to the bloody topic - nobody cares that you're put off by BugZilla's sparse interface.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 22:26                 ` Zdenek Kabelac
@ 2016-05-18  1:34                   ` Xen
  2016-05-18 12:15                     ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Xen @ 2016-05-18  1:34 UTC (permalink / raw)
  To: LVM general discussion and development

Zdenek Kabelac schreef op 18-05-2016 0:26:
> On 17.5.2016 22:43, Xen wrote:
>> Zdenek Kabelac schreef op 17-05-2016 21:18:
>> 
>> I don't know much about Grub, but I do know its lvm.c by heart now 
>> almost :p.
> 
> lvm.c by grub is mostly useless...

Then I feel we should take it out and not have grub capable of booting 
LVM volumes anymore at all, right.

>> One of the things I don't think people would disagree with would be 
>> having one
>> of either of:
>> 
>> - autoextend and waiting with writes so nothing fails
>> - no autoextend and making stuff read-only.
> 
> ATM user needs to write his own monitoring plugin tool to switch to
> read-only volumes - it's really as easy as running bash script in 
> loop.....

So you are saying every user of thin LVM must individually, that means 
if there are a 10.000 users, you now have 10.000 people needing to write 
the same thing, while first having to acquire the knowledge of how to do 
it.

I take it by that loop you mean a sleep loop. It might also be that 
logtail thing and then check for the dmeventd error messages in syslog. 
Right? And then when you find this message, you remount ro. You have to 
test a bit to make sure it works and then you are up and running. But 
this does imply that this thing is only available to die-hard users. You 
first have to be aware of what is going to happen. I tell you, there is 
really not a lot of good documentation on LVM okay. I know there is that 
LVM book. Let me get it....

First hit is CentOS. Second link is reddit. Third link is Redhat. Okay 
it should be "lvm guide" not "lvm book". Hasn't been updated since 2006 
and no advanced information other than how to compile and install....

I mean: http://tldp.org/HOWTO/LVM-HOWTO/. So what people are really 
going to know this stuff except the ones that are on this list?

Unless you experiment, you won't know what will happen to begin with. 
For instance, different topic, but it was impossible to find any real 
information on LVM cache.

So now you want every single admin to have the knowledge (that you 
obviously do have, but you are its writers and mainters, its gods and 
cohorts) to create a manual script, no matter how simple, that will 
check the syslog, that you can only really know about by checking the 
fucking source or running tests and then see what happens (and be smart 
enough to check syslog) -- and then of course to write either a service 
file for this script or put it in some form of rc.local.

Well that latter is easy enough even on my system (I was not even sure 
whether that existed here :p).

But knowing about this stuff doesn't come by itself. You know. This 
doesn't just fall from the sky.

I would probably be more than happy to write documentation at some point 
(because I guess I did go through all of that to learn, and maybe others 
shouldn't or won't have to?) but without this documentation, or this 
person leading the way, this is not easy stuff.

Also "info" still sucks on Linux, the only really available resource 
that is easy to use are man pages. It took me quite some time to learn 
about all the available lvm commands to begin with (without reading a 
encompassing manual) and imagine my horror when I was used to 
Debian/Ubuntu systems automatically activating the vg upon opening a 
LUKS container, but then the OpenSUSE rescue environment not doing that.

How to find out about vgchange -ay without having internet 
access.........

It was impossible.

So for me it has been a hard road to begin with and I am still learning.

In fact I *had* read about vgchange -ay but that was months prior and I 
had forgotten. Yes, bad sysadmin.

Every piece of effort a user can take on his own, is a piece of effort 
that can be prevented by a developer or even possibly a (documentation) 
writer if such a thing could exist. And I know I can't do it yet, if 
that is what you are asking or thinking.

> We call them 'Request For Enhancements' BZ....

You mean you have a non-special non-category that only distinguishes 
itself by having a [RFE] tag in the bug name, and that is your special 
feature? (laughs a bit).

I mean I'm not saying it has to be anything special and if you have a 
small system maybe that is enough.

But Bugzilla is just not an agreeable space to really inspire or invite 
positive feedback like that.... I mean I too have been using bugzillas 
for maybe a decade or longer. Not as a developer mostly, as a user. And 
the thing is just a cynical place. I mean, LOOK at Jira:

https://issues.apache.org/jira/browse/log4j2/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel

Just an example. A "bug" is just one out of many categories. They have 
issue types for Improvements, Brainstorming, New Feature, Question, 
Story, and Wish. It is so entirely inviting to do whatever you want to 
do. In BugZilla, a feature request is still just a bug. And in your 
RedHat system, you just have added some field called "doc type" that 
you've set to "enhancement" but that's it.

And a bug is a failure, it is a fault. The system is not meant for 
positive feedback, only negative feedback in that sense. The user 
experience of it is just vastly detrimental compared to that other 
thing....

Well I didn't really want to go into this, but since you invited it 
:pp....

But it is also meant for the coming thing. And I apologize.

>> First what I proposed would be for every thin volume to have a spare 
>> chunk.
>> But maybe that's irrelevant here.
> 
> Well the question was not asking for your 'technical' proposal, as you
> have no real idea how it works and your visions/estimations/guesses
> have no use at all (trust me - far deeper thinking was considered so
> don't even waste your time to write those sentences...)

Well you can drop the attitude you know. If you were doing so great, you 
would not be having a total lack of all useful documentation to begin 
with. You would not have a system that can freeze the entire system by 
default, because "policy" is apparently not well done.

You would not be having to debate how to make the system even a little 
bit safer, and excuse yourself every three lines by saying that it's the 
admin's job to monitor his system, not your job to make sure he doesn't 
need to do all that much, or your job to make sure the system is 
fail-safe to begin with.

I mean I understand that it is a work in progress. But then don't act 
like it is finished, or that it is perfect provided the administrator is 
perfect too.

If I'm trying to do anything here, it is to point out that the system is 
quite lacking by default. You say "policy, policy, policy" as though you 
are very tired. And maybe I'm a bit less so, I don't know. And I know it 
can be tiresome to have to make these... call them fine-tunements to 
make sure they work well by default on every system. Especially, I don't 
know. If it is a work in progress and not meant to be used by people not 
willing to invest as much as you have (so to speak).

And I'm not saying you are doing a bad job in developing this. I think 
LVM is one of the more sane systems existing in the Linux world today. I 
mean, I wouldn't be here if I didn't like it, or if I wasn't grateful 
for your work.

I think the commands themselves and their way of being used, is 
outstanding, they are intuitive, they are much better than many other 
systems out there (think mdadm). It takes hardly no pain to remember how 
to use e.g. lvcreate, or vgcreate, or whatever. It is intuitive, it is 
nice, sometimes you need a little lookup, and that is fast too. It is a 
bliss to use compared to other systems certainly. Many of the 
rudimentary things are possible, and the system is so nicely modular and 
layered that it is always obvious what you need to do at whatever point.

> Also forget you write a new FS - thinLV is block device so there is no
> such think like 'fs allocates' space on device - this space is meant
> to be there....

In this case, provided indeed none of that would happen (that we talked 
about earlier) the filesystem doesn't NEED to allocate anything, but it 
DOES know which part of the block space it already has in use and which 
parts it doesn't, and if it is aware of this, and if it is aware of the 
"real block size" of the underlying device provided it did do a form of 
allocation (as does LVM thin) then suddenly it doesn't NEED to know 
about this allocation other than to know that it is happening, and it 
only needs to know the alignment of the real blocks.

Of course that means some knowledge of the underlying the device, but as 
has been said earlier (by that other guy that supported it) this 
knowledge is already there at some level and it would not be that weird.

Yes it is that "integration" you so despise.

You are *already* integrating e.g. extfs to more closely honour the 
extent boundaries so that it is more efficient. What I am saying is not 
at all out of the ordinary with that. You could not optimize if the 
filesystem did not know about alignment, and if it could not "direct" 
'allocation' into those aligned areas. So the filesystem already knows 
what is going to happen down beneath, and it has the knowledge to choose 
not to write to new areas unless it has to. You *told* me so.

That means it can also choose not to write to any NEW "aligned" blocks.

So you are just being principial here. You attack the idea based on the 
fact that "there is no real allocation taking place of the block device 
by the filesystem". But if you drop the word, there is no reason to 
disagree with what I said.

The filesystem KNOWS allocation is getting done (or it could know) and 
if it knows about the block alignment of those extents, then it does not 
NEED to have intimate knowledge of the ACTUAL allocation getting done by 
the thin volume in the thin pool.

So what are you really disagreeing with here? You are just being 
pedantic right? You could tell the filesystem to enter 
no-allocation-mode or no-write-to-new-areas-mode (same thing here) or 
"no-cause-allocation-mode" (same thing here).

And it would work.

Even if you disagree with the term, it would still work. At least, as 
far as we go here.

You never said it wouldn't work. You just disagreed with my use of 
wording.

> Rather think in terms:
> 
> You have 2 thinLVs.
> 
> Origin + snapshot.
> 
> You write to origin - and you miss to write a block.
> 
> Such block may be located in  'fs' journal, it might be a 'data' block,
> or fs metadata block.
> 
> Each case may have different consequences.

But that is for the filesystem to decide. The thin volume will not know 
about the filesystem. In that sense. Layers, remember?

> When you fail to write an ordinary (non-thin) block device  - this
> block is then usually 'unreadable/error' - but in thinLV case - upon
> read you get previous 100% valid' content - so you may start to
> imagine where it's all heading.

So you mean that "unreadable/error" signifies some form of "bad sector" 
error. But if you fail to write to thinLV, doesn't that mean (in our 
case there) that the block was not allocated by thinLV? That means you 
cannot read from it either. Maybe bad example, I don't know.

> Basically solving these troubles when pool is 'full' is 'too late'.
> If user wants something 'reliable'  - he needs to use different 
> thresholds -
> i.e. stopping at 90%....

Well I will try to look into it more when I have time. But I don't 
believe you. I don't see a reason from the outset why it should or would 
need to be so. There should be no reason a write fails unless an 
allocate fails. So how could you ever read from it (unless you read 
random or white data). And, provided the filesystem does try to read 
from it; why would it do so if its write failed before that?

Maybe that is what you alluded to before, but a filesystem should be 
able to solve that on its own without knowing those details I think. I 
believe quite usually inodes are written in advance? They are not 
growth-scenarios. So this metadata cannot fail to write due to a failed 
block level allocate. But even that should be irrelevant for thin LVM 
itself.....

> But other users might be 'happy' with missing block (failing write
> area) and rather continue to use 'fs'....

But now you are talking about human users. You are now talking about an 
individual that tries to write to a thin LV, it doesn't work because the 
thing is full, and he/she wants to continue to use the 'fs'. But that is 
what I proposed right. If you have a fail-safe system, if you have a 
system that keeps functioning even though it blocks growth writes, then 
you have the best of both worlds. You have both.

It is not either/or. What I was talking about is both. You have 
reliability and you can keep using the filesystem. The filesystem just 
needs to be able to cope with the condition that it cannot use any new 
blocks from the existing pool that it knows about. That is not extremely 
very different from having exhausted its block pool to begin with. It is 
really the same condition, except right now it is rather artificial.

You artificially tell the FS: you are out of space. Or, you may not use 
new (alignment) blocks. It is no different from having no free blocks at 
all. The FS could deal with it in the same way.

> 
> You have many things to consider - but if you make policies too 
> complex,
> users will not be able to use it.
> 
> Users are already confused with 'simple' lvm.conf options like
> 'issue_discards'....

I understand. But that is why you create reasonable defaults that work 
well together. I mean, I am not telling you you can't, or have done a 
bad job in the past, or are doing a bad job now.

But I'm talking mostly about defaults. And right now I was really only 
proposing this idea of a filesystem state that says "Me, the filesystem, 
will not allocate any new blocks for data that are in alignment with the 
underlying block device. I will not use any new (extents) from my block 
device even though normally they would be available to me. I have just 
been told there might be an issue, and even though I don't know why, I 
will just accept that and try not to write there anymore".

It is really the simplest idea there can be here. If you didn't have 
thin, and the filesystem was full, you'd have the same condition.

It is just a "stop expanding" flag.

> Personally, I feel the condition of a filesystem getting into a "cannot
>> allocate" state, is superior.
> 
> As said - there is no thin-volume filesystem.

Can you just cut that, you know. I know the filesystem does not 
allocate. But it does know, or can know, allocation will happen. It 
might be aware of the "thin" nature, and even if it didn't, it could 
still honour such a flag even if it wouldn't make sense for it.

>> However in this case it needs no other information. It is just a 
>> state. It
>> knows: my block devices has 4M blocks (for instance), I cannot get new 
>> ones
> 
> Your thinking is from 'msdos' era - single process, single user.
> 
> You have multiple thin volumes active, with multiple different users
> all running their jobs in parallel and you do not want to stop every
> user when you are recomputing space in pool.
> 
> There is really no much point in explaining further details unless you 
> are
> willing to spend your time understanding deeply surrounding details.

You are using details to escape the necessity that the overlying or 
encompassing framework dictates that things do currently not work.

That is like using the trees to say that there is no forest.

Or not seeing the forest for the trees. That is exactly what it means. I 
know I am a child here. But do not ignore the wisdom of a child. The 
child knows more than you do. Even if it has much less data than you do.

The whole reason a child *can* know more is because it has less data. 
Because of that, it can still see the outline, while you may no longer 
be able to, because you are deep within the forest.

That's exactly what that saying means.

If you see planet earth from space and you see that it is turning or 
maybe you can see its ice caps are melting. And then someone on earth 
says "No that is not happening because such and such is so". Who is 
right? The one with the overview, or the one with the details?

An outsider can often perceive directly what is the nature of something. 
Only at the outside, of course. But he/she can clearly see whether it is 
left or right, big or small, cold or hot. It may not know why it is 
being hot or cold, but it does know that it is being cold or hot. And 
the outsider may see there should be no reason why something cannot be 
so.

If details are in the way, change the details.

By the above, with "user" you seem to mean a real human user. But a 
filesystem queues requests, it does not have multiple users. It needs to 
schedule whatever it is doing, but it all has to go through the same 
channel, ending up on the same disk. So from this perspective, the only 
relevant users are the various filesystems. This must be so, because if 
two operating systems mount the same block device twice, you get mayhem. 
So the filesystem driver is the channel. Whether it is one multitasking 
process or multiple users doing the same thing, is irrelevant. Jobs, in 
this sense, are also irrelevant. What is relevant is writes to different 
parts, or reads from different parts.

But supposing those multiple users are multiple filesystems using the 
same thin pool. Okay you have a point, perhaps. And indeed I do not know 
about any delays in space calculations. I am just approaching this from 
the perspective of a designer. I would not design it such that the data 
on the amount of free extents, would at any one time be unavailable. It 
should be available to all at any one time. It is just a number. It does 
not or should not need recomputation. I am sorry if that is incorrect 
here. If it does need recomputation, then of course what you say makes 
sense (even to me) and that you need a time window to prepare for 
disaster; to anticipate.

I don't see why a value like the number of free extents in a pool would 
need recomputation though, but that is just me. Even if you had 
concurrent writes (allocations/expansions) you should be able to deal 
with that, people do that all the time.

The number of free extents is simply a given at any one time right? 
Unless freeing them is a more involved operation. I'm just trying to 
show you that there shouldn't need to be any problems here with this 
idea.

Allocations should be atomic and even if they are concurrent, the 
updating of this information shouldn't be concurrent. It is a single 
number, only one person can change it at a time. It's a single number, 
even if you wrote 10 million blocks concurrently, your system should be 
able to change/increment that number 10 million times in the same time.

Right? I know you will say wrong. But this seems out of the ordinarily 
strange to me.

I mean I am still wholly unaware of how concurrency works in the kernel 
(except that I know the terms) (because I've been reading some code) 
(such as RCU, refcount, spinlock, mutex, what else) but I doubt this 
would be a real issue if you did it right, but that's just me.

If you can concurrently traverse data structures and keep everything 
working in pristine order, you know, why shouldn't you be able to 
'concurrently' update a number.

Maybe that's stupid of me, but it just doesn't make sense to me.

>> That seems pretty trivial. The mechanic for it may not. It is 
>> preferable in my
>> view if the filesystem was notified about it and would not even *try* 
>> to write
> 
> There is no 'try' operation.

You have seen Star Wars too much. That statement is misunderstood, Yoda 
tells a falsehood there.

There is a write operation that can fail or not fail.

> It would probably O^2 complicate everything - and the performance would
> drop by major factor - as you would need to handle cancellation....

Can you only think in troubles and worries? :P. I see you mean (I think) 
that some writes would succeed and some would fail and that that would 
complicate things? Other than that there is not much difference with a 
read-only filesystem right?

A filesystem that cannot even write to any new blocks is dead anyway. 
Why worry about performance in any case. It's a form of read-only mode 
or space-full mode that is not very different from existing modes. It's 
a single flag. Some writes succeed, some writes fail. System is almost 
dead to begin with, space is gone. Applications start to crash left and 
right. But at least the system survives.

Not sure what cancellation you are talking about or if you understood 
what I said before.....

> For simplicity here - just think about failing 'thin' write as a disk
> with 'write' errors, however upon read you get last written
> content....

So? And I still cannot see how that would happen. If the filesystem had 
not actually written to a certain area, it would also not try to read, 
right? Otherwise, the whole idea of "lazy allocation" of extents is 
impossible. I don't actually know what happens if you "read" the entire 
thin LV, and you could, but blocks that have never been allocated (by 
thin LV) should just return zero. I don't think anything else would 
happen?

I mean, there we go again: And of course the file contains nothing but 
zeroes, duh. Reading from a "nonwritten" extent just returns zero space. 
Obvious.

There is no reason why a thin write should fail if it has succeeded 
before to the same area. I mean, what's the issue here, you don't really 
explain. Anyway I am grateful for your time explaining this, but it just 
does not make much sense.

Then you can say "Oh I give up", but still, it does not make much sense.

> 'extX' will switch to  'ro'  upon write failure (when configured this 
> way).

Ah, you mean errors=remount-ro. Let me see what my default is :p. (The 
man page does not mention the default, very nice....).

Oh, it is continue by default. Obvious....

In any case that means if it did have a 3rd mount option type (like rw, 
ro, .....rp for "read/partial" ;-)).

It could also remount rp on errors ;-).

Thanks for the pointers all.

> 'XFS' in 'most' cases now will shutdown itself as well (being improved)
> 
> extX is better since user may still continue to use it at least in
> read-only mode...

Thanks. That is very welcome. But I need to be a complete expert to be 
able to use this thing. I will write a manual later :p. (If I'm still 
alive).

>> It seems completely obvious that to me at this point, if anything from 
>> LVM (or
>> e.g. dmeventd) could signal every filesystem on every affected thin 
>> volume, to
>> enter a do-not-allocate state, and filesystems would be able to fail 
>> writes
>> based on that, you would already have a solution right?
> 
> 'bash' loop...

I guess your --errorwhenfull y, combined with tunefs -e remount-ro, 
would also do the trick, but that works on ALL filesystem errors.

Like I said, I haven't tested it yet. Maybe we are covering nonsensical 
ground here.

But a bash loop is no solution for a real system.....

Yes thanks for pointing it out to me. But this email is getting way too 
long for me.

Anyway, we are also converging on the solution I'd like, so thank you 
for your time here regardless.

> Remember - not writing  'new' fs....

Never said I was. New state for existing fs.

> You are preparing for lost battle.
> Full pool is simply not a full fs.
> And thin-pool may get out-of-data  or out-of-metadata....

Does not have to be any different when the filesystem thinks and says it 
is full.

You are not going from full pool to full filesystem. The filesystem is 
not even full.

You are going from full pool, to a message to filesystems to enter 
no-expand-mode (no-allocate-mode), which will then simply cease growing 
into new "aligned" blocks.

What does it even MEAN to say that the two are not identical? I never 
talked about the two being identical. It is just an expansion freeze.

>> That would normally mean that filesystem operations such as DELETE 
>> would still
> 
> You really need to sit and think for a while what the snapshot and COW
> does really mean, and what is all written into a filesystem  (included
> with journal) when you delete a file.

Too tired now. I don't think deleting files requires growth of 
filesystem. I can delete files on a full fs just fine.

You mean a deletion on origin can cause allocation on snapshot.

Still that is not a filesystem thing, that is a thin-pool thing.

That is something for LVM to handle. I don't think this delete would 
fail, would it? If the snapshot is a block thing, it could write the 
changed inodes of the file and its directory.... it would only overwrite 
the actual data if that block was overwritten on origin.

So you run the risk of extent allocation for inodes.

But you have this problem today as well. It means clearing space could 
possibly need or would possibly need a work buffer. Some workspace.

You would need to pre-allocate space for the snapshot, as a practical 
measure. But that's not really a real solution.

The real solution is to buffer it in memory. If the deletes free space, 
you get free extents that you can use to write the memory buffered data 
(metadata). That's the only way to deal with that. You are just talking 
inodes (and possibly journal).

(But then how is the snapshot going to know these are deletes. In any 
case, you'd have the same problems with regular writes to origin. So I 
guess with snapshots you run into more troubles?

I guess with snapshots you either drops the snapshots or freeze the 
entire filesystem/volume? Then how will you delete anything?

You would either have to drop a snapshot, drop a thin volume, or copy 
the data first and then do that.

Right?

Too tired.

> But on of our 'polices' visions are to also use 'fstrim' when some
> threshold is reached or before thin snapshot is taken...

A discard filesystem (mounted discard) will automatically do that right, 
with a slight delay, so to speak.

I guess it would be good to do that, or warn the user to mount with 
"discard" option.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 20:43               ` Xen
@ 2016-05-17 22:26                 ` Zdenek Kabelac
  2016-05-18  1:34                   ` Xen
  0 siblings, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-17 22:26 UTC (permalink / raw)
  To: Xen, LVM general discussion and development

On 17.5.2016 22:43, Xen wrote:
> Zdenek Kabelac schreef op 17-05-2016 21:18:
>
> I don't know much about Grub, but I do know its lvm.c by heart now almost :p.

lvm.c by grub is mostly useless...

> One of the things I don't think people would disagree with would be having one
> of either of:
>
> - autoextend and waiting with writes so nothing fails
> - no autoextend and making stuff read-only.

ATM user needs to write his own monitoring plugin tool to switch to
read-only volumes - it's really as easy as running bash script in loop.....

> Alright. BugZilla is just for me not very amenable to /positive changes/, it
> seems so much geared towards /negative bugs/ if you know what I mean. Myself I
> would like to use more of Jira (Atlassian) but I did not say that ;-).

We call them 'Request For Enhancements' BZ....

>> To give some 'light' where is the 'core of problem'
>>
>> Imaging you have few thin LVs.
>> and you operate on a single one - which is almost fully provisioned
>> and just a single chunk needs to be provisioned.
>> And you fail to write.  It's really nontrivial to decided what needs
>> to happen.
>
> First what I proposed would be for every thin volume to have a spare chunk.
> But maybe that's irrelevant here.

Well the question was not asking for your 'technical' proposal, as you have no 
real idea how it works and your visions/estimations/guesses have no use at all 
(trust me - far deeper thinking was considered so don't even waste your time 
to write those sentences...)

Also forget you write a new FS - thinLV is block device so there is no such 
think like 'fs allocates' space on device - this space is meant to be there....

> When you say "it is nontrivial to decide what needs to happen" what you mean
> is: what should happen to the other volumes in conjunction to the one that
> just failed a write (allocation).

Rather think in terms:

You have 2 thinLVs.

Origin + snapshot.

You write to origin - and you miss to write a block.

Such block may be located in  'fs' journal, it might be a 'data' block,
or fs metadata block.

Each case may have different consequences.

When you fail to write an ordinary (non-thin) block device  - this block is 
then usually 'unreadable/error' - but in thinLV case - upon read you get 
previous 100% valid' content - so you may start to imagine where it's all heading.

Basically solving these troubles when pool is 'full' is 'too late'.
If user wants something 'reliable'  - he needs to use different thresholds -
i.e. stopping at 90%....

But other users might be 'happy' with missing block (failing write area) and 
rather continue to use 'fs'....

You have many things to consider - but if you make policies too complex,
users will not be able to use it.

Users are already confused with 'simple' lvm.conf options like 
'issue_discards'....

> Personally, I feel the condition of a filesystem getting into a "cannot
> allocate" state, is superior.

As said - there is no thin-volume filesystem.

> However in this case it needs no other information. It is just a state. It
> knows: my block devices has 4M blocks (for instance), I cannot get new ones

Your thinking is from 'msdos' era - single process, single user.

You have multiple thin volumes active, with multiple different users all 
running their jobs in parallel and you do not want to stop every user when you 
are recomputing space in pool.

There is really no much point in explaining further details unless you are
willing to spend your time understanding deeply surrounding details.

> * In your example, the last block of the entire thin pool is now gone
> * In your example, no other thin LV can get new blocks (extents, chunks)
> * In your example, all thin LVs would need to start blocking writes to new
> chunks in case there is no autoextend, or possibly delay them if there is.
>
> That seems pretty trivial. The mechanic for it may not. It is preferable in my
> view if the filesystem was notified about it and would not even *try* to write

There is no 'try' operation.

It would probably O^2 complicate everything - and the performance would
drop by major factor - as you would need to handle cancellation....

> new blocks anymore. Then, it can immediately signal userspace processes
> (programs) about writes starting to fail.

For simplicity here - just think about failing 'thin' write as a disk with 
'write' errors, however upon read you get last written content....

>
> Will mention that I still haven't tested --errorwhenfull yet.
>
> But this solution does seem to indicate you would need to either get all
> filesystems to either plainly block all new allocations, or be smart about it.
> Doesn't make a big difference.

'extX' will switch to  'ro'  upon write failure (when configured this way).

'XFS' in 'most' cases now will shutdown itself as well (being improved)

extX is better since user may still continue to use it at least in read-only 
mode...

> It seems completely obvious that to me at this point, if anything from LVM (or
> e.g. dmeventd) could signal every filesystem on every affected thin volume, to
> enter a do-not-allocate state, and filesystems would be able to fail writes
> based on that, you would already have a solution right?

'bash' loop...

> It would be a special kind of read-only. It would basically be a third state,
> after read-only, and read-write.

Remember - not writing  'new' fs....

>
> But it would need to be something that can take affect NOW. It would be a kind
> of degraded state. Some kind of emergency flag that says: sorry, certain
> things are going to bug out now. If the filesystem is very smart, it might
> still work for a while as old blocks are getting filled. If not, new
> allocations will fail and writes will ....somewhat randomly start to fail.

You are preparing for lost battle.
Full pool is simply not a full fs.
And thin-pool may get out-of-data  or out-of-metadata....

>
> That would normally mean that filesystem operations such as DELETE would still

You really need to sit and think for a while what the snapshot and COW does 
really mean, and what is all written into a filesystem  (included with 
journal) when you delete a file.

> work, ie. you keep a running system on which you can remove files and make space.
>
> That seems to be about as graceful as it can get. Right? Am I wrong?

Wrong...

But on of our 'polices' visions are to also use 'fstrim' when some threshold 
is reached or before thin snapshot is taken...

Z.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 19:18             ` Zdenek Kabelac
@ 2016-05-17 20:43               ` Xen
  2016-05-17 22:26                 ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Xen @ 2016-05-17 20:43 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Zdenek Kabelac

Zdenek Kabelac schreef op 17-05-2016 21:18:

> The message behind is - bootting from 'linear' LVs, and no msdos 
> partions...
> So right from a PV.
> Grub giving you 'menu' from bootable LVs...
> BootableLV combined with selected 'rootLV'...

I get it.

If that is the vision, I'm completely fine with that. I imagine everyone 
would. That would be rather nice.

I'm not that much of a snapshot person, but still, there is nothing 
really against it.

Andrei Borzenkov once told me on OpenSUSE list that there just (wasn't) 
support for thin yet at all in grub at that point (maybe a year ago that 
was?).

As I said I was working on an old patch to enable grub booting of PVs, 
but Andrei hasn't been responsive for more than a week. Maybe I'm just 
not very keen on all of this.

I don't know much about Grub, but I do know its lvm.c by heart now 
almost :p.

So yeah, anyway.

>> In my test, the thin volumes were created on another harddisk. I 
>> created a
>> small partition, put a thin pool in it, put 3 thin volumes in it, and 
>> then
>> overfilled it to test what would happen.
> 
> It's the very same issue if you'd have used 'slow' USB device - you
> may slow down whole linux usage - or in similar way building 4G .iso
> image.
> 
> My advice - try lowering  /proc/sys/vm/dirty_ration -   I'm using 
> '5'....

Yeah yeah, slow down. I first have to test the immediate failure and no 
waiting switch.

> Policies are hard and it's not quite easy to have some universal,
> that fits everyone needs here.

It depends on what people say they want.

In principle I don't think people would disagree with certain solutions 
if that was default.

One of the things I don't think people would disagree with would be 
having one of either of:

- autoextend and waiting with writes so nothing fails
- no autoextend and making stuff read-only.

I don't really think there are any other use cases. But like I 
indicated, any advanced system would only error on "growth writes?

> On the other hand it's relatively easy to write some 'tooling' for your
> particular needs - if you have nice 'walled' garden you could easily
> target it...

Sure and that's how every universal solution starts. But sometimes 
people just need to be convinced, and sometimes they need to convinced 
by seeing a working system and tests or statistics of whatever kind.

>> "Monitoring" and "stop using" is a process or mechanism that may very 
>> well be
>> encoded and be made default, at least for my own systems, but by 
>> extension, if
>> it works for me, maybe others can benefit as well.
> 
> Yes - this part will be extended and improved over the time.
> Already few BZ exists...
> It just takes time....

Alright. BugZilla is just for me not very amenable to /positive 
changes/, it seems so much geared towards /negative bugs/ if you know 
what I mean. Myself I would like to use more of Jira (Atlassian) but I 
did not say that ;-).

> Plain simplicity - umount is simple sys call, while 'mount -o
> remount,ro' is relatively complicated resource consuming process.
> There are some technical limitation related to usage operations like
> this behind 'dmeventd' - so it needs some redesigning for these new
> needs....

Okay. I thought it would be equivalent because both are called not as a 
system call, but it actually loads /bin/umount.

I guess that might mean you would need to trigger even another process, 
but you seem to be on top of it.

I would probably just blatantly get another daemon running, but I don't 
really have the skills for this yet. (I'm just approaching it from a 
quick & dirty perspective, as soon as I can get it running, at least I 
have a test system, proof of concept, or something that works).

> To give some 'light' where is the 'core of problem'
> 
> Imaging you have few thin LVs.
> and you operate on a single one - which is almost fully provisioned
> and just a single chunk needs to be provisioned.
> And you fail to write.  It's really nontrivial to decided what needs
> to happen.

First what I proposed would be for every thin volume to have a spare 
chunk. But maybe that's irrelevant here.

So there are two different cases as mentioned: existing block writes, 
and new block writes. What I was gabbing about earlier would be forcing 
a filesystem to also be able to distuinguish between them. You would 
have a filesystem-level "no extend" mode or "no allocate" mode that gets 
triggered. Initially my thought was to have this get triggered trough 
the FS-LVM interface. But, it could also be made operational not through 
any membrane but simply by having a kernel (module) that gets passed 
this information. In both cases the idea is to say: the filesystem can 
do what it wants with existing blocks, but it cannot get new ones.

When you say "it is nontrivial to decide what needs to happen" what you 
mean is: what should happen to the other volumes in conjunction to the 
one that just failed a write (allocation).

To begin with this is a problem situation to begin with, so programs, or 
system calls, erroring out, is expected and desirable, right.

So there are only three, four, five different cases:

- kernel informs VFS that all writes to all thin volumes should fail
- kernel informs VFS that all writes to new blocks on thin volumes 
should fail (not sure if it can know this)
- filesystem gets notified that new block allocation is not going to 
work, deal with it
- filesystem gets notified that all writes should cease (remount ro, in 
essence), deal with it.

Personally, I prefer the 3rd of these four.

Personally, I feel the condition of a filesystem getting into a "cannot 
allocate" state, is superior.

That would be a very powerful feature. Earlier I talked about all of 
this communication between the block layer and the filesystem layer 
right. But in this case it is just one flag, and it doesn't have the 
traverse the block-FS barrier.

However, it does mean the filesystem must know the 'hidden geometry' 
beneath its own blocks, so that it can know about stuff that won't work 
anymore.

However in this case it needs no other information. It is just a state. 
It knows: my block devices has 4M blocks (for instance), I cannot get 
new ones (or if I try, mayhem can ensue) and now I just need to 
indiscriminately fail writes that would require new blocks, try to 
redirect them to existing ones, let all existing-block writes continue 
as usual, and overall just fail a lot of stuff that would require new 
room.

Then of course your applications are still going to fail but that is the 
whole point. I'm not sure if the benefit is that outstanding as opposed 
to complete read-only, but it is very clear:

* In your example, the last block of the entire thin pool is now gone
* In your example, no other thin LV can get new blocks (extents, chunks)
* In your example, all thin LVs would need to start blocking writes to 
new chunks in case there is no autoextend, or possibly delay them if 
there is.

That seems pretty trivial. The mechanic for it may not. It is preferable 
in my view if the filesystem was notified about it and would not even 
*try* to write new blocks anymore. Then, it can immediately signal 
userspace processes (programs) about writes starting to fail.

Will mention that I still haven't tested --errorwhenfull yet.

But this solution does seem to indicate you would need to either get all 
filesystems to either plainly block all new allocations, or be smart 
about it. Doesn't make a big difference.

In principle if you had the means to acquire such a 
flag/state/condition, and the filesystem would be able to block new 
allocation wherever whenever, you would already have a working system. 
So what is then non-trivial?

The only case that is really nontrivial is that if you have autoextend. 
But even that you already have implemented.

It seems completely obvious that to me at this point, if anything from 
LVM (or e.g. dmeventd) could signal every filesystem on every affected 
thin volume, to enter a do-not-allocate state, and filesystems would be 
able to fail writes based on that, you would already have a solution 
right?

It would be a special kind of read-only. It would basically be a third 
state, after read-only, and read-write.

But it would need to be something that can take affect NOW. It would be 
a kind of degraded state. Some kind of emergency flag that says: sorry, 
certain things are going to bug out now. If the filesystem is very 
smart, it might still work for a while as old blocks are getting filled. 
If not, new allocations will fail and writes will ....somewhat randomly 
start to fail.

Certain things might continue working, others may not. Most applications 
would need to deal with that by themselves, which would normally have to 
be the case anyway. Ie. all over the field applications may start to 
fail. But that is what you want right. That is the only sensible thing.

If you have no autoextend.

That would normally mean that filesystem operations such as DELETE would 
still work, ie. you keep a running system on which you can remove files 
and make space.

That seems to be about as graceful as it can get. Right? Am I wrong?

>> Maybe that should be the default for any system that does not have 
>> autoextend
>> configured.
> 
> Yep policies, policies, policies....

Sounds like you could use a nice vacation in a bubble bath with nice 
champagne and good lighting, maybe a scented room, and no work for t 
least a week ;-).

And maybe some lovely ladies ;-) :P.

Personally I don't have the time for that, but I wouldn't say no to the 
ladies tbh.

Anyway let me just first test --errorwhenfull for you, or at least, for 
myself, to see if that completely solves the issue I had okay.

Regards and thanks for responding,

B.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 17:17           ` Xen
@ 2016-05-17 19:18             ` Zdenek Kabelac
  2016-05-17 20:43               ` Xen
  0 siblings, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-17 19:18 UTC (permalink / raw)
  To: LVM general discussion and development

On 17.5.2016 19:17, Xen wrote:
> Strange, I didn't get my own message.
>
>
> Zdenek Kabelac schreef op 17-05-2016 11:43:
>
>> There is no plan ATM to support boot from thinLV in nearby future.
>> Just use small boot partition - it's the safest variant - it just hold
>> kernels and ramdisks...
>
> That's not what I meant. Grub-probe will fail when the root filesystem is on
> thin, thereby making impossible the regeneration of your grub config files in
> /boot/grub.
>
> It will try to find the device for mounted /, and not succeed.
>
> Booting thin root is perfectly possible, ever since Kubuntu 14.10 at least (at
> least januari 2015).
>
>> We aim for a system with boot from single 'linear' with individual
>> kernel + ramdisk.
>>
>> It's simple, efficient and can be easily achieved with existing
>> tooling with some 'minor' improvements in dracut to easily allow
>> selection of system to be used with given kernel as you may prefer to
>> boot different thin snapshot of your root volume.
>
> Sure but won't happen if grub-update bugs on thin root.
>
> I'm not sure why we are talking about this now, or what I asked ;-).
>

The message behind is - bootting from 'linear' LVs, and no msdos partions...
So right from a PV.
Grub giving you 'menu' from bootable LVs...
BootableLV combined with selected 'rootLV'...

>> Complexity of booting right from thin is very high with no obvious benefit.
>
> I understand. I had not even been trying to achieve yet, although it has or
> might have principal benefit, the way doing away with partitions entirely
> (either msdos or gpt) has a benefit on its own.
>
> But as you indicate, you can place boot on non-thin LVM just fine, so there is
> not really that issue as you say.
>
>>> But for me, a frozen volume would be vastly superior to the system locking up.
>>
>> You miss the knowledge how the operating system works.
>>
>> Your binary is  'mmap'-ed for a device. When the device holding binary
>> freezes, your binary may freeze (unless it is mlocked in memory).
>>
>> So advice here is simple - if you want to run unfreezable system -
>> simply do not run this from a thin-volume.
>
> I did not run from a thin-volume, that's the point.
>
> In my test, the thin volumes were created on another harddisk. I created a
> small partition, put a thin pool in it, put 3 thin volumes in it, and then
> overfilled it to test what would happen.

It's the very same issue if you'd have used 'slow' USB device - you may slow 
down whole linux usage - or in similar way building 4G .iso image.

My advice - try lowering  /proc/sys/vm/dirty_ration -   I'm using '5'....


>> The best advice we have - 'monitor' fullness - when it's above - stop
>> using such system and ensure there will be more space -  there is
>> noone else to do this task for you - it's the price you pay for
>> overprovisioning.
>
> The point is that not only as an admin (for my local systems) but also as a
> developer, there is no point in continuing a situation that could be mitigated
> by designing tools for this purpose.
>
> There is no point for me if I can make this easier by automating tools for
> performing these tasks, instead of doing them by hand. If I can create tools
> or processes that do, what I would otherwise have needed to do by hand, then
> there is no point in continuing to do it by hand. That is the whole point of
> "automation" everywhere.

Policies are hard and it's not quite easy to have some universal,
that fits everyone needs here.

On the other hand it's relatively easy to write some 'tooling' for your
particular needs - if you have nice 'walled' garden you could easily target it...

> "Monitoring" and "stop using" is a process or mechanism that may very well be
> encoded and be made default, at least for my own systems, but by extension, if
> it works for me, maybe others can benefit as well.

Yes - this part will be extended and improved over the time.
Already few BZ exists...
It just takes time....

>
> I am not clear why a forced lazy umount is better, but I am sure you have your
> reason for it. It just seems that in many cases, an unwritable but present
> (and accessible) filesystem is preferable to none at all.

Plain simplicity - umount is simple sys call, while 'mount -o remount,ro' is 
relatively complicated resource consuming process.  There are some technical 
limitation related to usage operations like this behind 'dmeventd' - so it 
needs some redesigning for these new needs....


> I do not mean any form of differentiation or distinction. I mean an overall
> forced read only mode on all files, or at least all "growing", for the entire
> volume (or filesystem on it) which would pretty much be the equivalent of
> remount,ro. The only distinction you could ever possibly want in there is to
> block "new growth" writes while allowing writes to existing blocks. That is
> the only meaningful distinction I can think of.
>
> Of course, it would be pretty much equivalent to a standard mount -o
> remount,ro, and would still depend on thin pool information.
>

To give some 'light' where is the 'core of problem'

Imaging you have few thin LVs.
and you operate on a single one - which is almost fully provisioned
and just a single chunk needs to be provisioned.
And you fail to write.  It's really nontrivial to decided what needs
to happen.

>> Worth to note here - you can set your thin-pool with 'instant'
>> erroring in case you know you do not plan to resize it (avoiding
>> 'freeze')
>>
>> lvcreate/lvchange --errorwhenfull  y|n
>
> Ah thank you, that could solve it. I will try again with the thin test the
> moment I feel like rebooting again. The harddrive is still available, haven't
> installed my system yet.
>
> Maybe that should be the default for any system that does not have autoextend
> configured.

Yep policies, policies, policies....

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17  9:43         ` Zdenek Kabelac
@ 2016-05-17 17:17           ` Xen
  2016-05-17 19:18             ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Xen @ 2016-05-17 17:17 UTC (permalink / raw)
  To: LVM general discussion and development

Strange, I didn't get my own message.

Zdenek Kabelac schreef op 17-05-2016 11:43:

> There is no plan ATM to support boot from thinLV in nearby future.
> Just use small boot partition - it's the safest variant - it just hold
> kernels and ramdisks...

That's not what I meant. Grub-probe will fail when the root filesystem 
is on thin, thereby making impossible the regeneration of your grub 
config files in /boot/grub.

It will try to find the device for mounted /, and not succeed.

Booting thin root is perfectly possible, ever since Kubuntu 14.10 at 
least (at least januari 2015).

> We aim for a system with boot from single 'linear' with individual
> kernel + ramdisk.
> 
> It's simple, efficient and can be easily achieved with existing
> tooling with some 'minor' improvements in dracut to easily allow
> selection of system to be used with given kernel as you may prefer to
> boot different thin snapshot of your root volume.

Sure but won't happen if grub-update bugs on thin root.

I'm not sure why we are talking about this now, or what I asked ;-).

> Complexity of booting right from thin is very high with no obvious 
> benefit.

I understand. I had not even been trying to achieve yet, although it has 
or might have principal benefit, the way doing away with partitions 
entirely (either msdos or gpt) has a benefit on its own.

But as you indicate, you can place boot on non-thin LVM just fine, so 
there is not really that issue as you say.

>> But for me, a frozen volume would be vastly superior to the system 
>> locking up.
> 
> You miss the knowledge how the operating system works.
> 
> Your binary is  'mmap'-ed for a device. When the device holding binary
> freezes, your binary may freeze (unless it is mlocked in memory).
> 
> So advice here is simple - if you want to run unfreezable system -
> simply do not run this from a thin-volume.

I did not run from a thin-volume, that's the point.

In my test, the thin volumes were created on another harddisk. I created 
a small partition, put a thin pool in it, put 3 thin volumes in it, and 
then overfilled it to test what would happen.

At first nothing happened, but as I tried to read back from the volume 
that had supposedly been written to, the entire system froze. My system 
had no active partitions on that harddisk other than those 3 thin 
volumes.

> ATM there are some 'black holes' as filesystem were not deeply tested
> in all corner cases which now could be 'easily' hit with thin usage.
> This is getting improved - but advice  "DO NOT" run thin-pool 100%
> still applies.

I understand.

> The best advice we have - 'monitor' fullness - when it's above - stop
> using such system and ensure there will be more space -  there is
> noone else to do this task for you - it's the price you pay for
> overprovisioning.

The point is that not only as an admin (for my local systems) but also 
as a developer, there is no point in continuing a situation that could 
be mitigated by designing tools for this purpose.

There is no point for me if I can make this easier by automating tools 
for performing these tasks, instead of doing them by hand. If I can 
create tools or processes that do, what I would otherwise have needed to 
do by hand, then there is no point in continuing to do it by hand. That 
is the whole point of "automation" everywhere.

I am not going to be a martyr just for the sake of people saying that a 
real admin would do everything by himself, by hand, by never sleeping 
and setting alarm clocks every hour to check on his system, if you know 
what I mean.

"Monitoring" and "stop using" is a process or mechanism that may very 
well be encoded and be made default, at least for my own systems, but by 
extension, if it works for me, maybe others can benefit as well.

I see no reason for remaining a spartan if I can use code to solve it as 
well.

Just the fact that auto-unmount and auto-extend exists, means you do not 
disagree with this.

Regards.

> If you need something 'urgently' now  -  you could i.e. monitor your 
> syslog
> message for 'dmeventd' report and run  i.e.  'reboot' in some case...

Well I guess I will just try to find time to develop that applet/widget 
I mentioned.

Of course an automated mechanism would be nice. The issue is not 
filesystem corruption. The issue is my system freezing entirely. I'd 
like to prevent that. Meaning, if I were to change the thin dmeventd 
module, to remount ro, it would probably already be solved for me, if I 
recompile and can use the compiled version.

I am not clear why a forced lazy umount is better, but I am sure you 
have your reason for it. It just seems that in many cases, an unwritable 
but present (and accessible) filesystem is preferable to none at all.

> or instead of reboot   'mount -o remount,ro' - whatever fits...
> Just be aware that relatively 'small' load on filesystem may easily 
> provision
> major portion of thin-pool quickly.

Depending on size of pool, right. It remains a race against the clock.

>> Maybe it would even be possible to have a kernel module that blocks a 
>> certain
>> kind of writes, but these things are hard, because the kernel doesn't 
>> have a
>> lot of places to hook onto by design. You could simply give the 
>> filesystem (or
>> actually the code calling for a write) write failures back.
> 
> There are no multiple write queues at dm level where you could select
> you want to store data from LibreOffice, but you want to throw out
> your Firefox files...

I do not mean any form of differentiation or distinction. I mean an 
overall forced read only mode on all files, or at least all "growing", 
for the entire volume (or filesystem on it) which would pretty much be 
the equivalent of remount,ro. The only distinction you could ever 
possibly want in there is to block "new growth" writes while allowing 
writes to existing blocks. That is the only meaningful distinction I can 
think of.

Of course, it would be pretty much equivalent to a standard mount -o 
remount,ro, and would still depend on thin pool information.

> dmeventd is quite quick when it 'detects' threshold (recent version of 
> lvm2).

Right.

> Your 'write' queue (amount of dirty-pages) could be simply full of
> write to 'blocked' device, and without 'time-outing' writes (60sec)
> you can't write anything anywhere else...

Roger that, so it is really a resource issue. Currently I am running 
this here system off of a USB 2 stick. I can tell you. IO blocking 
happens more than sunrays bouncing off walls in my house, and they do 
that a lot, too.

Something as simple as "man command" may block the system for 10 seconds 
or more. Often times everything stops responding. I can see the USB 
stick working. And then after a while the system resumes as normal. I 
have a read speed of 25MB/s but something is amiss with IO scheduling.

> Worth to note here - you can set your thin-pool with 'instant'
> erroring in case you know you do not plan to resize it (avoiding
> 'freeze')
> 
> lvcreate/lvchange --errorwhenfull  y|n

Ah thank you, that could solve it. I will try again with the thin test 
the moment I feel like rebooting again. The harddrive is still 
available, haven't installed my system yet.

Maybe that should be the default for any system that does not have 
autoextend configured.

Regards.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-17 13:09   ` Gionatan Danti
@ 2016-05-17 13:48     ` Zdenek Kabelac
  2016-05-18 13:47       ` Gionatan Danti
  0 siblings, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-17 13:48 UTC (permalink / raw)
  To: LVM general discussion and development

On 17.5.2016 15:09, Gionatan Danti wrote:
>
>
>> Well yeah - ATM we rather take 'early' action and try to stop any user
>> on overfill thin-pool.
>>
>
> It is a very reasonable standing
>
>>
>>
>> Basically whenever  'lvresize' failed - dmeventd plugin now tries
>> to unconditionally umount any associated thin-volume with
>> thin-pool above threshold.
>>
>>
>>
>> For now  -  plugin   'calls'  the tool - lvresize --use-policies.
>> If this tool FAILs for ANY  reason ->  umount will happen.
>>
>> I'll probably put in 'extra' test that 'umount' happens
>> with  >=95% values only.
>>
>> dmeventd  itself has no idea if there is configure 100 or less - it's
>> the lvresize to see it - so even if you set 100% - and you have enabled
>> monitoring  - you will get umount (but no resize)
>>
>>
>
> Ok, so the "failed to resize" error is also raised when no actual resize
> happens, but the call to the "dummy" lvresize fails. Right?

Yes - in general - you've witnessed  general tool failure,
and dmeventd is not 'smart' to recognize the reason of failure.

Normally this 'error' should not happen.

And while I'd even say there could have been a 'shortcut'
without even reading VG 'metadata' - since there is profile support,
it can't be known (100% threshold) without actually reading metadata
(so it's quite tricky case anyway)

>>
>> Well 'lvmetad' shall not crash, ATM this may kill commands - and further
>> stop processing - as we rather 'stop' further usage rather than allowing
>> to cause bigger damage.
>>
>> So if you have unusual system/device setup causing  'lvmetad' crash -
>> open BZ,
>> and meawhile  set   'use_lvmetad=0' in your lvm.conf till the bug is fixed.
>>
>
> My 2 cents are that the last "yum upgrade", which affected the lvm tools,
> needed a system reboot or at least the restart of the lvm-related services
> (dmeventd and lvmetad). The strange thing is that, even if lvmetad crashed, it
> should be restartable via the lvm2-lvmetad.socket systemd unit. Is this a
> wrong expectation?


Assuming you've been bitten by this one:

https://bugzilla.redhat.com/1334063

possibly? targeted by this commit:

https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=7ef152c07290c79f47a64b0fc81975ae52554919

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 12:08 ` Zdenek Kabelac
  2016-05-16 13:01   ` Xen
@ 2016-05-17 13:09   ` Gionatan Danti
  2016-05-17 13:48     ` Zdenek Kabelac
  1 sibling, 1 reply; 23+ messages in thread
From: Gionatan Danti @ 2016-05-17 13:09 UTC (permalink / raw)
  To: Zdenek Kabelac, LVM general discussion and development



> Well yeah - ATM we rather take 'early' action and try to stop any user
> on overfill thin-pool.
>

It is a very reasonable standing

>
>
> Basically whenever  'lvresize' failed - dmeventd plugin now tries
> to unconditionally umount any associated thin-volume with
> thin-pool above threshold.
>
>
>
> For now  -  plugin   'calls'  the tool - lvresize --use-policies.
> If this tool FAILs for ANY  reason ->  umount will happen.
>
> I'll probably put in 'extra' test that 'umount' happens
> with  >=95% values only.
>
> dmeventd  itself has no idea if there is configure 100 or less - it's
> the lvresize to see it - so even if you set 100% - and you have enabled
> monitoring  - you will get umount (but no resize)
>
>

Ok, so the "failed to resize" error is also raised when no actual resize 
happens, but the call to the "dummy" lvresize fails. Right?

>
> If you strictly don't care about any tracing of thin-pool fullness,
> disable  monitoring in lvm.conf.
>

While this thin pool should never be overfilled (it has a single, 
slightly smaller volume with no snapshot in place) I would really like 
to leave monitoring enabled, as it can prevent some nasty suprises (eg: 
avoid pool overfilling by a snapshot that is "forgotten" and never removed).

>
>
> Well 'lvmetad' shall not crash, ATM this may kill commands - and further
> stop processing - as we rather 'stop' further usage rather than allowing
> to cause bigger damage.
>
> So if you have unusual system/device setup causing  'lvmetad' crash -
> open BZ,
> and meawhile  set   'use_lvmetad=0' in your lvm.conf till the bug is fixed.
>

My 2 cents are that the last "yum upgrade", which affected the lvm 
tools, needed a system reboot or at least the restart of the lvm-related 
services (dmeventd and lvmetad). The strange thing is that, even if 
lvmetad crashed, it should be restartable via the lvm2-lvmetad.socket 
systemd unit. Is this a wrong expectation?

Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 19:25       ` Xen
  2016-05-16 21:39         ` Xen
@ 2016-05-17  9:43         ` Zdenek Kabelac
  2016-05-17 17:17           ` Xen
  1 sibling, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-17  9:43 UTC (permalink / raw)
  To: LVM general discussion and development

On 16.5.2016 21:25, Xen wrote:
> Zdenek Kabelac schreef op 16-05-2016 16:09:
>
>> Behavior should be there for quite a while, but relatively recent fixes
>> in dmeventd has made it working more reliable in more circumstances.
>> I'd recommend to play at least with 142 - but since recent releases are
>> bugfix oriented - if you are compiling yourself - just take latest.
>
> I don't use my thin volumes for the system. That is difficult anyway because
> Grub doesn't allow it (although I may start writing for that at some point).

There is no plan ATM to support boot from thinLV in nearby future.
Just use small boot partition - it's the safest variant - it just hold kernels 
and ramdisks...

We aim for a system with boot from single 'linear' with individual kernel + 
ramdisk.

It's simple, efficient and can be easily achieved with existing tooling with 
some 'minor' improvements in dracut to easily allow selection of system to be 
used with given kernel as you may prefer to boot different thin snapshot of 
your root volume.

Complexity of booting right from thin is very high with no obvious benefit.

> But for me, a frozen volume would be vastly superior to the system locking up.

You miss the knowledge how the operating system works.

Your binary is  'mmap'-ed for a device. When the device holding binary 
freezes, your binary may freeze (unless it is mlocked in memory).

So advice here is simple - if you want to run unfreezable system - simply do 
not run this from a thin-volume.

> So while I was writing all of that ..material, I didn't realize that in my
> current system's state, the thing would actually cause the entire system to
> freeze. Not directly, but within a minute or so everything came to a halt.
> When I rebooted, all of the volumes were filled 100%, that is to say, all of
> the thin capacities added up to a 100% for the thin pool, and the pool itself
> was at 100%.
>
> I didn't check the condition of the filesystem. You would assume it would
> contain partially written files.

ATM there are some 'black holes' as filesystem were not deeply tested in all 
corner cases which now could be 'easily' hit with thin usage.
This is getting improved - but advice  "DO NOT" run thin-pool 100% still applies.

> If there is anything that would actually freeze the volume but not bring the
> system down, I would be most happy. But possibly it's the (ext) filesystem
> driver that makes trouble? Like we said, if there is no way to communicate
> space-fullness, what is it going to do right?

The best advice we have - 'monitor' fullness - when it's above - stop using 
such system and ensure there will be more space -  there is noone else to do 
this task for you - it's the price you pay for overprovisioning.

> So is that dmeventd supposed to do anything to prevent disaster? Would I need
> to write my own plugin/configuration for it?

dmeventd only monitors and calls command to try to resize, and may try to 
umount volumes in case disaster is approaching.

We plan to add more 'policy' logic - so you would be able to define what 
should happen when some fullness is reached - but that's just plan ATM.

If you need something 'urgently' now  -  you could i.e. monitor your syslog
message for 'dmeventd' report and run  i.e.  'reboot' in some case...

> It is not running on my system currently. Without further amendments of course
> the only thing it could possibly do is to remount a filesystem read-only, like
> others have indicated it possibly already could.

or instead of reboot   'mount -o remount,ro' - whatever fits...
Just be aware that relatively 'small' load on filesystem may easily provision
major portion of thin-pool quickly.

>
> Maybe it would even be possible to have a kernel module that blocks a certain
> kind of writes, but these things are hard, because the kernel doesn't have a
> lot of places to hook onto by design. You could simply give the filesystem (or
> actually the code calling for a write) write failures back.

There are no multiple write queues at dm level where you could select you want 
to store data from LibreOffice, but you want to throw out your Firefox files...

> what other people have said. That the system already does this (mounting
> read-only).
>
> I believe my test system just failed because the writes only took a few
> seconds to fill up the volume. Not a very good test, sorry. I didn't realize
> that, that it would check only in intervals.

dmeventd is quite quick when it 'detects' threshold (recent version of lvm2).

> I still wonder what freezes my system like that.

Your 'write' queue (amount of dirty-pages) could be simply full of write to 
'blocked' device, and without 'time-outing' writes (60sec) you can't write 
anything anywhere else...

Worth to note here - you can set your thin-pool with 'instant' erroring in 
case you know you do not plan to resize it (avoiding 'freeze')

lvcreate/lvchange --errorwhenfull  y|n

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 19:25       ` Xen
@ 2016-05-16 21:39         ` Xen
  2016-05-17  9:43         ` Zdenek Kabelac
  1 sibling, 0 replies; 23+ messages in thread
From: Xen @ 2016-05-16 21:39 UTC (permalink / raw)
  To: Linux lvm

In likelyhood there is just a delay, but I am sending this mail again in 
case it didn't arrive:

Zdenek Kabelac schreef op 16-05-2016 16:09:

> Behavior should be there for quite a while, but relatively recent fixes
> in dmeventd has made it working more reliable in more circumstances.
> I'd recommend to play at least with 142 - but since recent releases are
> bugfix oriented - if you are compiling yourself - just take latest.

Thanks Zdenek. You know I've had an interest in more thin safety, and 
though I may have been an ass sometimes, and though I was told that the 
perfect admin never runs into trouble ;-), I'm still concerned with 
actual practical measures :p.

I don't use my thin volumes for the system. That is difficult anyway 
because Grub doesn't allow it (although I may start writing for that at 
some point). But for me, a frozen volume would be vastly superior to the 
system locking up.

So while I was writing all of that ..material, I didn't realize that in 
my current system's state, the thing would actually cause the entire 
system to freeze. Not directly, but within a minute or so everything 
came to a halt. When I rebooted, all of the volumes were filled 100%, 
that is to say, all of the thin capacities added up to a 100% for the 
thin pool, and the pool itself was at 100%.

I didn't check the condition of the filesystem. You would assume it 
would contain partially written files.

If there is anything that would actually freeze the volume but not bring 
the system down, I would be most happy. But possibly it's the (ext) 
filesystem driver that makes trouble? Like we said, if there is no way 
to communicate space-fullness, what is it going to do right?

So is that dmeventd supposed to do anything to prevent disaster? Would I 
need to write my own plugin/configuration for it?

It is not running on my system currently. Without further amendments of 
course the only thing it could possibly do is to remount a filesystem 
read-only, like others have indicated it possibly already could.

Maybe it would even be possible to have a kernel module that blocks a 
certain kind of writes, but these things are hard, because the kernel 
doesn't have a lot of places to hook onto by design. You could simply 
give the filesystem (or actually the code calling for a write) write 
failures back.

All of that code is not filesystem dependent, in the sense that you can 
simply capture those writes in the VFS system, and not pass them on. At 
the cost of some extra function calls. But then you would need that 
module to know that certain volumes are read-only, and others aren't. 
All in all not very hard to do, if you know how to do the concurrency. 
In that case you could have a dmeventd plugin that would set this state, 
and possibly a user tool that would unset it. The state is set for all 
of the volumes of a thin pool, so the user tool would only need to unset 
this for the thin pool, not the volumes. In practice, in the beginning, 
this would be all you would need.

So I am just currently wondering about that

what other people have said. That the system already does this (mounting 
read-only).

I believe my test system just failed because the writes only took a few 
seconds to fill up the volume. Not a very good test, sorry. I didn't 
realize that, that it would check only in intervals.

I still wonder what freezes my system like that.

Regards, B.

And I'm sorry for any .... disturbance I may have caused here. Regards.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 14:09     ` Zdenek Kabelac
@ 2016-05-16 19:25       ` Xen
  2016-05-16 21:39         ` Xen
  2016-05-17  9:43         ` Zdenek Kabelac
  0 siblings, 2 replies; 23+ messages in thread
From: Xen @ 2016-05-16 19:25 UTC (permalink / raw)
  To: LVM general discussion and development

Zdenek Kabelac schreef op 16-05-2016 16:09:

> Behavior should be there for quite a while, but relatively recent fixes
> in dmeventd has made it working more reliable in more circumstances.
> I'd recommend to play at least with 142 - but since recent releases are
> bugfix oriented - if you are compiling yourself - just take latest.

Thanks Zdenek. You know I've had an interest in more thin safety, and 
though I may have been an ass sometimes, and though I was told that the 
perfect admin never runs into trouble ;-), I'm still concerned with 
actual practical measures :p.

I don't use my thin volumes for the system. That is difficult anyway 
because Grub doesn't allow it (although I may start writing for that at 
some point). But for me, a frozen volume would be vastly superior to the 
system locking up.

So while I was writing all of that ..material, I didn't realize that in 
my current system's state, the thing would actually cause the entire 
system to freeze. Not directly, but within a minute or so everything 
came to a halt. When I rebooted, all of the volumes were filled 100%, 
that is to say, all of the thin capacities added up to a 100% for the 
thin pool, and the pool itself was at 100%.

I didn't check the condition of the filesystem. You would assume it 
would contain partially written files.

If there is anything that would actually freeze the volume but not bring 
the system down, I would be most happy. But possibly it's the (ext) 
filesystem driver that makes trouble? Like we said, if there is no way 
to communicate space-fullness, what is it going to do right?

So is that dmeventd supposed to do anything to prevent disaster? Would I 
need to write my own plugin/configuration for it?

It is not running on my system currently. Without further amendments of 
course the only thing it could possibly do is to remount a filesystem 
read-only, like others have indicated it possibly already could.

Maybe it would even be possible to have a kernel module that blocks a 
certain kind of writes, but these things are hard, because the kernel 
doesn't have a lot of places to hook onto by design. You could simply 
give the filesystem (or actually the code calling for a write) write 
failures back.

All of that code is not filesystem dependent, in the sense that you can 
simply capture those writes in the VFS system, and not pass them on. At 
the cost of some extra function calls. But then you would need that 
module to know that certain volumes are read-only, and others aren't. 
All in all not very hard to do, if you know how to do the concurrency. 
In that case you could have a dmeventd plugin that would set this state, 
and possibly a user tool that would unset it. The state is set for all 
of the volumes of a thin pool, so the user tool would only need to unset 
this for the thin pool, not the volumes. In practice, in the beginning, 
this would be all you would need.

So I am just currently wondering about that

what other people have said. That the system already does this (mounting 
read-only).

I believe my test system just failed because the writes only took a few 
seconds to fill up the volume. Not a very good test, sorry. I didn't 
realize that, that it would check only in intervals.

I still wonder what freezes my system like that.

Regards, B.

And I'm sorry for any .... disturbance I may have caused here. Regards.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 13:01   ` Xen
@ 2016-05-16 14:09     ` Zdenek Kabelac
  2016-05-16 19:25       ` Xen
  0 siblings, 1 reply; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-16 14:09 UTC (permalink / raw)
  To: LVM general discussion and development

On 16.5.2016 15:01, Xen wrote:
> Zdenek Kabelac schreef op 16-05-2016 14:08:
>
>> Hi
>>
>> Well yeah - ATM we rather take 'early' action and try to stop any user
>> on overfill thin-pool.
>
> May I inquire into whether this is only available in newer versions atm?
>
> These are my versions (lvs --version):
>
>    LVM version:     2.02.133(2) (2015-10-30)
>    Library version: 1.02.110 (2015-10-30)
>    Driver version:  4.34.0
>
> I have noticed that on my current system the entire system will basically
> freeze on overfill.
>
> I wonder if I should take measures to upgrade to something newer if I want to
> prevent this.

Behavior should be there for quite a while, but relatively recent fixes
in dmeventd has made it working more reliable in more circumstances.
I'd recommend to play at least with 142 - but since recent releases are
bugfix oriented - if you are compiling yourself - just take latest.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-16 12:08 ` Zdenek Kabelac
@ 2016-05-16 13:01   ` Xen
  2016-05-16 14:09     ` Zdenek Kabelac
  2016-05-17 13:09   ` Gionatan Danti
  1 sibling, 1 reply; 23+ messages in thread
From: Xen @ 2016-05-16 13:01 UTC (permalink / raw)
  To: LVM general discussion and development

Zdenek Kabelac schreef op 16-05-2016 14:08:

> Hi
> 
> Well yeah - ATM we rather take 'early' action and try to stop any user
> on overfill thin-pool.

May I inquire into whether this is only available in newer versions atm?

These are my versions (lvs --version):

   LVM version:     2.02.133(2) (2015-10-30)
   Library version: 1.02.110 (2015-10-30)
   Driver version:  4.34.0

I have noticed that on my current system the entire system will 
basically freeze on overfill.

I wonder if I should take measures to upgrade to something newer if I 
want to prevent this.

Regards,

X.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed?
  2016-05-15 10:33 Gionatan Danti
@ 2016-05-16 12:08 ` Zdenek Kabelac
  2016-05-16 13:01   ` Xen
  2016-05-17 13:09   ` Gionatan Danti
  0 siblings, 2 replies; 23+ messages in thread
From: Zdenek Kabelac @ 2016-05-16 12:08 UTC (permalink / raw)
  To: LVM general discussion and development, g.danti

On 15.5.2016 12:33, Gionatan Danti wrote:
> Hi list,
> I had an unexptected filesystem unmount on a machine were I am using thin
> provisioning.
>

Hi

Well yeah - ATM we rather take 'early' action and try to stop any user
on overfill thin-pool.



> It is a CentOS 7.2 box (kernel 3.10.0-327.3.1.el7, lvm2-2.02.130-5.el7_2.1),
> with the current volumes situation:
> # lvs -a
>    LV                   VG         Attr       LSize  Pool         Origin
> Data%  Meta%  Move Log Cpy%Sync Convert
>    000-ThinPool         vg_storage twi-aotz-- 10.85t 74.06  33.36
>    [000-ThinPool_tdata] vg_storage Twi-ao---- 10.85t
>    [000-ThinPool_tmeta] vg_storage ewi-ao---- 88.00m
>    Storage              vg_storage Vwi-aotz-- 10.80t 000-ThinPool 74.40
>    [lvol0_pmspare]      vg_storage ewi------- 88.00m
>    root                 vg_system  -wi-ao---- 55.70g
>    swap                 vg_system  -wi-ao----  7.81g
>
> As you can see, thin pool/volume is at about 75%.
>
> Today I found the Storage volume unmounted, with the following entries in
> /var/log/message:
> May 15 09:02:53 storage lvm[43289]: Request to lookup VG vg_storage in lvmetad
> gave response Connection reset by peer.
> May 15 09:02:53 storage lvm[43289]: Volume group "vg_storage" not found
> May 15 09:02:53 storage lvm[43289]: Failed to extend thin
> vg_storage-000--ThinPool-tpool.
> May 15 09:02:53 storage lvm[43289]: Unmounting thin volume
> vg_storage-000--ThinPool-tpool from /opt/storage.

Basically whenever  'lvresize' failed - dmeventd plugin now tries
to unconditionally umount any associated thin-volume with
thin-pool above threshold.


> What puzzle me is that both thin_pool_autoextend_threshold and
> snap_pool_autoextend_threshold are disabled in the lvm.conf file
> (thin_pool_autoextend_threshold = 100 and snap_pool_autoextend_threshold =
> 100). Moreover, no custom profile/policy is attached to the thin pool/volume.

For now  -  plugin   'calls'  the tool - lvresize --use-policies.
If this tool FAILs for ANY  reason ->  umount will happen.

I'll probably put in 'extra' test that 'umount' happens
with  >=95% values only.

dmeventd  itself has no idea if there is configure 100 or less - it's the 
lvresize to see it - so even if you set 100% - and you have enabled
monitoring  - you will get umount (but no resize)


>
> To me, it seems that the lvmetad crashed/had some problems and the system,
> being "blind" about the thin volume utilization, put it offline. But I can not
> understand the "Failed to extend thin vg_storage-000--ThinPool-tpool", and I
> had *no* autoextend in place.

If you strictly don't care about any tracing of thin-pool fullness,
disable  monitoring in lvm.conf.


>
> I rebooted the system and the Storage volume is now mounted without problems.
> I also tried to write about 16 GB of raw data to it, and I have no problem.
> However, I can not understand why it was put offline in the first place. As a
> last piece of information, I noted that kernel & lvm was auto-updated two days
> ago. Maybe it is related?
>
> Can you give me some hint of what happened, and how to avoid it in the future?

Well 'lvmetad' shall not crash, ATM this may kill commands - and further stop 
processing - as we rather 'stop' further usage rather than allowing
to cause bigger damage.

So if you have unusual system/device setup causing  'lvmetad' crash - open BZ,
and meawhile  set   'use_lvmetad=0' in your lvm.conf till the bug is fixed.

Regards

Zdenek

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend  disabled - lvmetad crashed?
@ 2016-05-15 10:33 Gionatan Danti
  2016-05-16 12:08 ` Zdenek Kabelac
  0 siblings, 1 reply; 23+ messages in thread
From: Gionatan Danti @ 2016-05-15 10:33 UTC (permalink / raw)
  To: linux-lvm

Hi list,
I had an unexptected filesystem unmount on a machine were I am using 
thin provisioning.

It is a CentOS 7.2 box (kernel 3.10.0-327.3.1.el7, 
lvm2-2.02.130-5.el7_2.1), with the current volumes situation:
# lvs -a
   LV                   VG         Attr       LSize  Pool         Origin 
Data%  Meta%  Move Log Cpy%Sync Convert
   000-ThinPool         vg_storage twi-aotz-- 10.85t                     
74.06  33.36
   [000-ThinPool_tdata] vg_storage Twi-ao---- 10.85t
   [000-ThinPool_tmeta] vg_storage ewi-ao---- 88.00m
   Storage              vg_storage Vwi-aotz-- 10.80t 000-ThinPool        
74.40
   [lvol0_pmspare]      vg_storage ewi------- 88.00m
   root                 vg_system  -wi-ao---- 55.70g
   swap                 vg_system  -wi-ao----  7.81g

As you can see, thin pool/volume is at about 75%.

Today I found the Storage volume unmounted, with the following entries 
in /var/log/message:
May 15 09:02:53 storage lvm[43289]: Request to lookup VG vg_storage in 
lvmetad gave response Connection reset by peer.
May 15 09:02:53 storage lvm[43289]: Volume group "vg_storage" not found
May 15 09:02:53 storage lvm[43289]: Failed to extend thin 
vg_storage-000--ThinPool-tpool.
May 15 09:02:53 storage lvm[43289]: Unmounting thin volume 
vg_storage-000--ThinPool-tpool from /opt/storage.
...

The lines above repeated each 10 seconds.

What puzzle me is that both thin_pool_autoextend_threshold and 
snap_pool_autoextend_threshold are disabled in the lvm.conf file 
(thin_pool_autoextend_threshold = 100 and snap_pool_autoextend_threshold 
= 100). Moreover, no custom profile/policy is attached to the thin 
pool/volume.

To me, it seems that the lvmetad crashed/had some problems and the 
system, being "blind" about the thin volume utilization, put it offline. 
But I can not understand the "Failed to extend thin 
vg_storage-000--ThinPool-tpool", and I had *no* autoextend in place.

I rebooted the system and the Storage volume is now mounted without 
problems. I also tried to write about 16 GB of raw data to it, and I 
have no problem. However, I can not understand why it was put offline in 
the first place. As a last piece of information, I noted that kernel & 
lvm was auto-updated two days ago. Maybe it is related?

Can you give me some hint of what happened, and how to avoid it in the 
future?
Thanks.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-05-24 17:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1872684910.4114972.1463547443287.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-18  4:57 ` [linux-lvm] Unexptected filesytem unmount with thin provision and autoextend disabled - lvmetad crashed? matthew patton
2016-05-18 14:20   ` Xen
     [not found] <766997897.3926921.1463545271031.JavaMail.yahoo.ref@mail.yahoo.com>
2016-05-18  4:21 ` matthew patton
2016-05-15 10:33 Gionatan Danti
2016-05-16 12:08 ` Zdenek Kabelac
2016-05-16 13:01   ` Xen
2016-05-16 14:09     ` Zdenek Kabelac
2016-05-16 19:25       ` Xen
2016-05-16 21:39         ` Xen
2016-05-17  9:43         ` Zdenek Kabelac
2016-05-17 17:17           ` Xen
2016-05-17 19:18             ` Zdenek Kabelac
2016-05-17 20:43               ` Xen
2016-05-17 22:26                 ` Zdenek Kabelac
2016-05-18  1:34                   ` Xen
2016-05-18 12:15                     ` Zdenek Kabelac
2016-05-17 13:09   ` Gionatan Danti
2016-05-17 13:48     ` Zdenek Kabelac
2016-05-18 13:47       ` Gionatan Danti
2016-05-24 13:45         ` Gionatan Danti
2016-05-24 14:17           ` Zdenek Kabelac
2016-05-24 14:28             ` Gionatan Danti
2016-05-24 17:17               ` Zdenek Kabelac

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.