* [linux-lvm] about the lying nature of thin @ 2016-04-28 22:37 Xen 2016-04-29 8:44 ` Marek Podmaka 2016-05-10 21:47 ` [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) Linda A. Walsh 0 siblings, 2 replies; 16+ messages in thread From: Xen @ 2016-04-28 22:37 UTC (permalink / raw) To: Linux lvm You know mr. Patton made the interesting allusion that thin provisioning is designed to lie and is meant to lie, and I beg to differ. Under normal operating conditions any thin volume should be allowed to grow to its maximum V-size provided not everyone is doing that at the same time. There is nowhere in the thin contract something that says "this space I have available to you, you don't have to share it". That is like saying the basement container room used as bike and motor space in my apartment complex is a like because if I were to fill it up, other people couldn't use it again anymore. The visuals clearly indicate available physical space, but I know that if I use it, others won't be able to. It's called sharing. In practical matters a thin volume only starts to lie when "real space" < "virtual space" -- a condition you are normally trying to avoid. So I would not even say that by definition a thin volume or thin volume manager lies. It only starts "lying" the moment real available space goes below virtual available space, something you would normally be trying to avoid. Since your guarantee to your customers (for instance) is that this space IS going to be available, you're actually lying to them by not informing them of the condition that this guarantee can not actually be met in some instance of time. Thin pools do not lie by default. They lie when they cannot fulfill their obligations, and this is precisely the reason for the idea I suggested: to stop the lie, to be honest. It was said (by Marek Podmaka) that you don't want customers / users to know about the reality behind the thin pool, in some or many use cases (liberally interpreted). That there are use cases where you don't want the client to know about the thin nature. But if you don't do your job right and the thin pool does start to fill up, that starts to sound like lying to your client and saying "everything is all right" while behind the scenes everyone is in calamity mode. "Is something wrong? No no, not@all". You're usually aware that you're being lied to ;-) if you are talking to a real human. So basically: * either you do your job right and nothing is the matter * you don't do your job right but you don't tell anyone * you don't do your job right and you own up. Saying that thin pools habitually lie is not right. The question is not what happens or what you do while the system is functioning as intended. The question is what you do when that is no longer the case: * do you inform the guest system? * do you keep silent until shit breaks loose? IF you had an autoextend mechanism present you could also equally well decide to not "inform" clients as long as that was the case. After all, if you have automatic extending configured and it is operational, then the "real size" is actually larger than what you currently have. In that case "virtual size < real size" does not hold or does not happen, and there is no need to communicate anything. This is also a question about ethics, perhaps. Personally I like to be informed. I don't know what you do or want. But I can think of any number of analogies or life situations where I would definitely choose to be informed instead of being lied to. Thin LVM does not lie by default. It may only start to lie when conditions are no longer met. Regards, Xen. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-28 22:37 [linux-lvm] about the lying nature of thin Xen @ 2016-04-29 8:44 ` Marek Podmaka 2016-04-29 10:06 ` Gionatan Danti ` (2 more replies) 2016-05-10 21:47 ` [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) Linda A. Walsh 1 sibling, 3 replies; 16+ messages in thread From: Marek Podmaka @ 2016-04-29 8:44 UTC (permalink / raw) To: Xen; +Cc: Linux lvm Hello Xen, Friday, April 29, 2016, 0:37:23, you wrote: > In practical matters a thin volume only starts to lie when "real space" > < "virtual space" -- a condition you are normally trying to avoid. > Thin pools do not lie by default. They lie when they cannot fulfill > their obligations, and this is precisely the reason for the idea I > suggested: to stop the lie, to be honest. I would say that thin provisioning is designed to lie about the available space. This is what it was invented for. As long as the used space (not virtual space) is not greater then real space, everything is ok. Your analogy with customers still applies and whole IT business is based on it (over-provisioning home internet connection speed, "guaranteed" webhosting disk space). It seems to me that disk space was the last thing to get over- (or thin-) provisioned :) Now I'm not sure what your use-case for thin pools is. I don't see it much useful if the presented space is smaller than available physical space. In that case I can just use plain LVM with PV/VG/LV. For snaphosts you don't care much as if the snapshot overfills, it just becomes invalid, but won't influence the original LV. But their use case is to simplify the complexity of adding storage. Traditionally you need to add new physical disks to the storage / server, add it to LVM as new PV, add this PV to VG, extend LV and finally extend filesystem. Usually the storage part and server (LVM) part is done by different people / teams. By using thinp, you create big enough VG, LV and filesystem. Then as it is needed you just add physical disks and you're done. Another benefit is disk space saving. Traditionally you need to have some reserve as free space in each filesystem for growth. With many filesystems you just wasted a lot of space. With thinp, this free space is "shared". And regarding your other mail about presenting parts / chunks of blocks from block layer... This is what device mapper (and LVM built on top of it) does - it takes many parts of many block devices and creates new linear block device out of them (whether it is stripped LV, mirrored LV, dm-crypt or just concatenation of 2 disks). -- bYE, Marki ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 8:44 ` Marek Podmaka @ 2016-04-29 10:06 ` Gionatan Danti 2016-04-29 13:16 ` Xen 2016-04-29 11:53 ` Xen 2016-04-29 20:37 ` Chris Friesen 2 siblings, 1 reply; 16+ messages in thread From: Gionatan Danti @ 2016-04-29 10:06 UTC (permalink / raw) To: Marek Podmaka, LVM general discussion and development, Xen On 29/04/2016 10:44, Marek Podmaka wrote: > Hello Xen, > Now I'm not sure what your use-case for thin pools is. > > I don't see it much useful if the presented space is smaller than > available physical space. In that case I can just use plain LVM with > PV/VG/LV. For snaphosts you don't care much as if the snapshot > overfills, it just becomes invalid, but won't influence the original > LV. > Let me add one important use case: have fast, flexible snapshots. In the past I used classic LVM to build our virtualization servers, but this means I was basically forced to use a separate volume for each VM: using a single big volume and filesystem for all the VMs means that, while snapshotting it for backup purpose, I/O become VERY slow on ALL virtual machines. On the other hand, thin pools provide much faster snapshots. On the latest builds, I begin using a single large thin volume, on top of a single large thin pool, to host a single filesystem that can be snapshotted with no big slowdown on the I/O part. I understand that it is a tradeoff - classic LVM mostly provides contiguous blocks, so fragmentation remain quite low, while thin pools/volumes are much more prone to fragament, but with large enough chunks it is not such a big problem. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 10:06 ` Gionatan Danti @ 2016-04-29 13:16 ` Xen 2016-04-29 22:32 ` Xen 0 siblings, 1 reply; 16+ messages in thread From: Xen @ 2016-04-29 13:16 UTC (permalink / raw) To: LVM general discussion and development Gionatan Danti schreef op 29-04-2016 12:06: > Let me add one important use case: have fast, flexible snapshots. One more huge reason for using it in a desktop system. I didn't know about the performance benefits. I just know that providing snapshot space in advance by registering in advance LVs for that purpose, is not a good way of working (for me, or anyone). Although the idea of using LVM thin to provide only a single thin volume might be rather odd ;-). Still, the snapshotting is clearly superior to that of traditional LVM, right? Regards. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 13:16 ` Xen @ 2016-04-29 22:32 ` Xen 2016-04-30 4:46 ` Mark Mielke 0 siblings, 1 reply; 16+ messages in thread From: Xen @ 2016-04-29 22:32 UTC (permalink / raw) To: LVM general discussion and development I guess this Patton guy knows everything about everything, but I'm not responding to him anymore. As he sets up his business empire he leaves us all in the dust anyway. So I guess I will just keep it to the thing I know something about, which is talking to real people. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 22:32 ` Xen @ 2016-04-30 4:46 ` Mark Mielke 2016-05-03 13:03 ` Xen 0 siblings, 1 reply; 16+ messages in thread From: Mark Mielke @ 2016-04-30 4:46 UTC (permalink / raw) To: LVM general discussion and development [-- Attachment #1: Type: text/plain, Size: 2700 bytes --] Lots of interesting ideas in this thread. But the practical of things is that there is a need for thin volumes that are over provisioned. Call it a lie if you must, but I want to have multiple snapshots, and not be forced to have 10X the storage, just so that I can *guarantee* that I will have the technical capability to fully allocate every snapshot without running out of space. This is for my requirements, where I am not being naive or irresponsible. I'm not representing the situation to myself. I know exactly what to expect, and I know that it isn't only important to monitor, but it is also important to understand the usage patterns. For example, in some of our use cases, files will only normally be extended or created as new, at which point the overhead of a snapshot is close to zero. If people find this model unacceptable, then I think they should not use thin volumes. It's a technology choice. We have many systems like this beyond LVM... For example, the NetApp FAS devices we have are set up with this type of model, and IT normally allocates 10% or more for "snapshots", and when we get this wrong, it does hurt in various ways, usually requiring that the snapshots get dumped, and that we figure out why the monitoring failed. Normally, IT adds to the aggregate as it passes a threshold. In the particular case that is important for me - we have a fixed size local SSD for maximum performance, and we still want to take frequent snapshots (and prune them behind), similar to what we do on NetApp, but all in the context of local storage. I don't use the word "lie" to IT in these cases. It's a partnership, and attempt to make the most use of the storage and the technology. There was some discussion about how data is presented to the higher layers. I didn't follow the suggestion exactly (communicating layout information?), but I did have these thoughts: 1. When the storage runs out, it clearly communicates layout information to the caller in the form of a boolean "does it work or not?" 2. There are other ways that information does get communicated, such as if a device becomes read only. For example, an iSCSI LUN. I didn't follow communication of specific layout information as this didn't really make sense to me when it comes to dynamic allocation. But, if the intent is to provide early warning of the likelihood of failure, compared to waiting to the very last minute where it has already failed, it seems like early warning would be useful. I did have a question about the performance of this type of communication, however, as I wouldn't want the host to be constantly polling the storage to recalculate the up-to-date storage space available. [-- Attachment #2: Type: text/html, Size: 2918 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-30 4:46 ` Mark Mielke @ 2016-05-03 13:03 ` Xen 0 siblings, 0 replies; 16+ messages in thread From: Xen @ 2016-05-03 13:03 UTC (permalink / raw) To: LVM general discussion and development Mark Mielke schreef op 30-04-2016 6:46: > Lots of interesting ideas in this thread. Thank you for your sane response. > There was some discussion about how data is presented to the higher > layers. I didn't follow the suggestion exactly (communicating layout > information?), but I did have these thoughts: > > * When the storage runs out, it clearly communicates layout > information to the caller in the form of a boolean "does it work or > not?" > * There are other ways that information does get communicated, such > as if a device becomes read only. For example, an iSCSI LUN. > > I didn't follow communication of specific layout information as this > didn't really make sense to me when it comes to dynamic allocation. > But, if the intent is to provide early warning of the likelihood of > failure, compared to waiting to the very last minute where it has > already failed, it seems like early warning would be useful. I did > have a question about the performance of this type of communication, > however, as I wouldn't want the host to be constantly polling the > storage to recalculate the up-to-date storage space available. Zdenec alluded to the idea and fact that this continuous polling would either be required or deeply ungrateful to the hardware. In the sense of being hugely expensive. Of course I do not know everything about a system before I start thinking. If I have an idea it is usually possible to implement it but I only find out later down the road if this is actually so and if it needs amending. I could not progress with life if every idea needed to be 100% sure before I could commence with it, because in that sense the commencing and the learning would never happen. I didn't know thin (or LVM) doesn't maintain maps of used blocks. Of course for regular LVM it makes no sense if the usage of the blocks you have allocated to a system is none of your concern at all. The recent DISCARD improvements apparently just signal some special case (?) but SSDs DO maintain maps or it wouldn't even work (?). I don't know, it would seem that having a map of used extents in a thin pool is in some way deeply important in being able to allocate unused ones? I would have to dig into it of course but I am sure I would be able to find some information (and not lies ;-))). I guess continuous polling would be deeply disrespectful of the hardware and software resources. In the theoretical system I proposed it would be a constant communication between systems bogging down resources. But we must agree we are typically talking about 4MB blocks here (and mutations to them). In a sense you could easily increase that to 16MB, or 32MB, or whatever. You could even update a filesystem when mutations of a thousand gigabytes have happened. We are talking about a map of regions and these regions can be as large as you want. It would say to a filesystem: these regions are currently unavailable. You would even get more flags: - this region is entirely unavailable - this region is now more expensive to allocate to - this region is the preferred place When you allocate memory in the kernel (like with kmalloc) you specify what kind of requirements you have. This is more of the same kind, I guess. Typically a thin system is a system of extent allocation, the way we have it. It is the thin volume that allocates this space, but the filesystem that causes it. The thin volume would be able to say "don't use these parts". Or "all parts are equal, but don't use more than X currently". Actually the latter is a false statement, you need real information. I know in ext filesystems the inodes are scattered everywhere (and the tables) so the blocks are already getting used, in that sense. And if you had very large blocks that you would want to make totally unavailable, you would get weird issues. "That's funny, I'm already using it". So in order to make sense they would have to be contiguous regions (in the virtual space) that are really not used yet. I don't know, it seems fun to make something like that. Maybe I'll do it some day. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 8:44 ` Marek Podmaka 2016-04-29 10:06 ` Gionatan Danti @ 2016-04-29 11:53 ` Xen 2016-04-29 20:37 ` Chris Friesen 2 siblings, 0 replies; 16+ messages in thread From: Xen @ 2016-04-29 11:53 UTC (permalink / raw) To: Linux lvm Marek Podmaka schreef op 29-04-2016 10:44: > I would say that thin provisioning is designed to lie about the > available space. This is what it was invented for. As long as the used > space (not virtual space) is not greater then real space, everything > is ok. Your analogy with customers still applies and whole IT business > is based on it (over-provisioning home internet connection speed, > "guaranteed" webhosting disk space). It seems to me that disk space > was the last thing to get over- (or thin-) provisioned :) But you see if my landlord tells me I can use the entire container room, except that I have to share it with others, does he lie? I *can* use the entire container room. I just have to ensure it is empty again by the end of the day (or even sooner). Those ISPs do not say "Every client can use the full bandwidth all at the same time." They don't say that. They say "Fair use policies apply". That's what they say. And they mean that no, you can't do that stuff 24/7/365. So let's talk then about two things you can lie about: * available space * the thought that all of the space is available to everyone at all times. In a normal use case, only the latter would be a lie. But that's not what companies tell their clients. Maybe implicitly, at times. But not explicitly at all (hence fair use policy). The former is not a lie. If you have a 1000 customers, and each has 50GB available total, and the average use at this point is 25GB, and you have provisioned for ~35GB each, meaning 35000 GB is available and 25000 is in use, then it is not a lie to say to any individual customer: you can use 50GB if you want. The guarantee that everyone can do it all at the same time, just doesn't hold, but that is never communicated. As a customer you are not aware of how many other clients there are, or how many other thin volumes (ordinarily) or what the max capacity is across all the volumes. So you are not being lied to. For it to be a lie, you would have to be concerned about the total picture. You would have to have an awareness of other clients and then you would need to make the assumption that all of these clients at the same time can use all of that bandwidth/data/space. But your personal scenario doesn't extend that far. Just as a funny example. Nearby there was a supermarket that advertized with that (to my mind) stupid thought "if there are more than 4 customers in line, and you are the 5th, you get your groceries for free". What did a local student's house do? They went to the supermarket with about 20 people and got a lot of stuff free. I mean in statistics you have queue calculations too but it gets defeated if people start doing that stuff (thwarting the mechanism on purpose). For example, the traditional statistics example is that of customers at a hairsalon. Based on a certain distribution and an average number of new arrivals, a conclusion is reached and certain data is found. But this data is thwarted the moment customers on purpose start to pile up just to thwart this data, you get what I mean? Any /intentional/ purpose to thwart the average, means it is no longer the average. Normal people wanting a haircut do not show up at a salon to thwart the salons calculations. Ordinary use cases do not apply to this. If you can expect a command normal amount of use, then there is no "intent" with those clients to be doing anything out of the ordinary. Just like that "hairsalon" can normally depend on those "calculations" (you could, you know) and provision for that (number of employees present) so too can a thin provisioning setup depend on expected averages (in a distribution, the "expected" value of a stochast is the expected average) (as a prediction in that sense). There's no lying in that. If this hairsalon now says "You can get cut within 10 minutes without an appointment" then yes people could thwart that by suddenly all showing up at the same time. Doesn't work like that in reality when people do not have such intentions. We call that "innocence" ;-) not doing something on purpose. That hairsalon is not lying if it guarantees 10 minute wait time in general. It just cannot guarantee it if people start to bugger. Statistics is all about averages and large numbers. "A "law of large numbers" is one of several theorems expressing the idea that as the number of trials of a random process increases, the percentage difference between the expected and actual values goes to zero." That means that if you have enough numbers (enough thin volumes) the likelihood in actuality between what you promise and what you can deliver, the difference goes to zero and in effect you are always speaking the truth. Remember: you are speaking the truth given normal expected reality. You are no longer speaking the truth if people start to mess with you on purpose. If you have 10.000 clients and 5.000 of them are one person intending to bug you out, just like in the supermarket example, well, then you've lost. But, that is an intentional devious thing to do just in order to make use of some monetary loophole in the system, so to speak. And in general your terms of use could guard against that (and many companies do, I'm sure). > Now I'm not sure what your use-case for thin pools is. Presently maximizing space efficiency across a small number of volumes, as well as access to superior snapshotting ability. > I don't see it much useful if the presented space is smaller than > available physical space. In that case I can just use plain LVM with > PV/VG/LV. For snaphosts you don't care much as if the snapshot > overfills, it just becomes invalid, but won't influence the original > LV. You mean there'd not be any use for thin, right. I agree. The whole idea is to be more efficient with space. If the presented space is smaller than you HAVE room for those snapshots. But with thin, you don't need to care. Space is always there. > But their use case is to simplify the complexity of adding storage. > Traditionally you need to add new physical disks to the storage / > server, add it to LVM as new PV, add this PV to VG, extend LV and > finally extend filesystem. Usually the storage part and server (LVM) > part is done by different people / teams. By using thinp, you create > big enough VG, LV and filesystem. Then as it is needed you just add > physical disks and you're done. True but let's call it "sharing" resources. Sharing resources is the whole idea of any advanced society. Our western mindset doesn't work in the sense of everyone needing to be able to possess everything. The example was given that everyone owns a car, that they may not use every day, a washing machine, that they may use 5 hours a week, a vacuum cleaner, that they may use 1 hour a week, and so on and so on. The example was given that a commercial airliner could *never* do something like that. Commercial airplanes are in operation pretty much 24/7. Disuse is way too costly. They cannot afford to not use their machines 24/7. Our society cannot either, but the way we live and operate with each other currently ensures vasts amounts of wasted materials, energy and so on. Resource sharing is an advanced concept in that sense. Let's just call thin pools an advanced concept :p. And let's not call it a lie just like that :) :P. > Another benefit is disk space saving. Traditionally you need to have > some reserve as free space in each filesystem for growth. With many > filesystems you just wasted a lot of space. With thinp, this free > space is "shared". My reason exactly. > And regarding your other mail about presenting parts / chunks of > blocks from block layer... This is what device mapper (and LVM built > on top of it) does - it takes many parts of many block devices and > creates new linear block device out of them (whether it is stripped > LV, mirrored LV, dm-crypt or just concatenation of 2 disks). I know. But that is the reverse thing. DM/LVM takes dispersed stuff and presents a whole. In this case we were talking about presenting holes. That's because in this case ..... If you are that barber/haircutter and suddenly you get an influx of clients you cannot handle. Are you going to put up a sign saying "sorry, too busy" or are you going to try to keep your "promise" to each and every one of them? I hope you didn't offer financial compensation in that sense ;-). Personally I think that as a client you making use of such "financial promises" is very intolerant and unforgiving and greedy and even avaricious ;-). So what if your thin pool does fill up and you have no measure in place to handle it? Are you going to be honest? This question is not whether thin is currently lying. This is about whether you will continue to choose for it to lie. It is not about the present. It is about the choice you are going to make. Do you choose to lie or not? Traditionally companies have always tried to keep up the pretense until all hell broke loose so badly that it spilled out like a tidal wave. You can find any number of examples in the history of our world. I am currently thinking of the Exxon Valdez, and Enron. I don't know if that is applicable. Also thinking of that platform in recent times, of BP. Deepwater Horizon, which was said to have been deeply undermaintained. I mean you can keep pretending everything is going just perfect, or you can own up a little sooner. That is a choice to make for each individual I guess. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 8:44 ` Marek Podmaka 2016-04-29 10:06 ` Gionatan Danti 2016-04-29 11:53 ` Xen @ 2016-04-29 20:37 ` Chris Friesen 2 siblings, 0 replies; 16+ messages in thread From: Chris Friesen @ 2016-04-29 20:37 UTC (permalink / raw) To: linux-lvm On 04/29/2016 03:44 AM, Marek Podmaka wrote: > Now I'm not sure what your use-case for thin pools is. > > I don't see it much useful if the presented space is smaller than > available physical space. In that case I can just use plain LVM with > PV/VG/LV. For snaphosts you don't care much as if the snapshot > overfills, it just becomes invalid, but won't influence the original > LV. One useful case for "presented space equal to physical space" with thin volumes is that it simplifies security issues. With raw LVM volumes I generally need to zero out the whole volume prior to deleting it (to avoid leaking the contents to other users). This takes time, and also seriously hammers the disks when you have multiple volumes being zeroed in parallel. With thin, deletion is essentially instantaneous, and the zeroing penalty is paid when the disk block is actually written. Any disk blocks which have not been written are simply read as all-zeros. Chris ^ permalink raw reply [flat|nested] 16+ messages in thread
* [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) 2016-04-28 22:37 [linux-lvm] about the lying nature of thin Xen 2016-04-29 8:44 ` Marek Podmaka @ 2016-05-10 21:47 ` Linda A. Walsh 2016-05-10 23:58 ` Xen 1 sibling, 1 reply; 16+ messages in thread From: Linda A. Walsh @ 2016-05-10 21:47 UTC (permalink / raw) To: LVM general discussion and development Xen wrote: > You know mr. Patton made the interesting allusion that thin > provisioning is designed to lie and is meant to lie, and I beg to differ. ---- Isn't using a thin memory pool for disk space similar to using a virtual memory/swap space that is smaller than the combined sizes of all processes? I.e. Administrators can choose to decide whether to over-allocate swap or paging file space or to have it be a hard limit -- and forgive me if I'm wrong, but isn't this a configurable in /proc/sys/vm with the over-commit parms (among others)? Doesn't over-commit in the LVM space have similar checks and balances as over-commit in the VM space? Whether it does or doesn't, shouldn't the reasoning be similar in how they can be controlled? In regards to LVM overcommit -- does it matter (at least in the short term), if that over-committed space is filled with "SPARSE" data files?. I mean, suppose I allocate space for astronomical bodies -- in some areas/directions, I might have very SPARSE usage, vs. towards the core of a galaxy, I might expect less sparce usage. If a file system can be successfully closed with 'no errors' -- doesn't that still mean it is "integrous" -- even if its sparse files don't all have enough room to be expanded? Does it make sense to think about a OOTS (OutOfThinSpace) daemon that can be setup with priorities to reclaim space? I see 2 types of "quota" here. And I can see the metaphor of these types being extended into disk space: Direct space, that physically present, and "indirect or *temporary* space" -- which you might try to reserve at the beginning of a job. Your job could be configured to wait until the indirect space is available, or die immediately. But conceivably indirect space is space on a robot-cartridge retrieval system that has huge amount of virtual space, but at the cost of needing to be loaded before your job can run. Extending that idea -- the indirect space could be configured as "high priority space" -- meaning once it is allocated, it stays allocated *until* the job completes (in other words the job would have a low chance of being "evicted" by an OOTS damon), vs. most "extended space would have the priority of "temporary space" -- with processes using large amounts of such 'indirect space and having a low expectation of quick completion being high on the oots-daemon's list? Processes could also be willing to "give up memory and suspend" -- where, when called, a handler could give back Giga-or Tera bytes of memory and save it's state as needing to restart the last pass. Lots of possibilities -- if LVM-this space is managed like memory-virtual space. That means some outfits might choose to never over-allocate, while others might allow fraction. From how it sounds -- when you run out of thin space, what happens now is that the OS keeps allocating more Virtual space that has no backing store (in memory or on disk)...with a notification buried in a system log somewhere. On my own machine, I've seen >50% of memory returned after sending a '3' to /proc/sys/vm/drop_caches -- maybe similar emergency measures could help in the short term, with long term handling being as similarly flexible as VM policies. Does any of this sound sensible or desirable? How much effort is needed for how much 'bang'? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) 2016-05-10 21:47 ` [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) Linda A. Walsh @ 2016-05-10 23:58 ` Xen 0 siblings, 0 replies; 16+ messages in thread From: Xen @ 2016-05-10 23:58 UTC (permalink / raw) To: LVM general discussion and development Hey sweet Linda, this is beyond me at the moment. You go very far with this. Linda A. Walsh schreef op 10-05-2016 23:47: > Isn't using a thin memory pool for disk space similar to using > a virtual memory/swap space that is smaller than the combined sizes of > all > processes? I think there is a point to that, but for me the concordance is in the idea that filesystems should perhaps have different modes of requesting memory (space) as you detail below. Virtual memory typically cannot be expanded (automatically) although you could. Even with virtual memory there is normally a hard limit, and unless you include shared memory, there is not really any relation with overprovisioned space, unless you started talking about prior allotment, and promises being given to processes (programs) that a certain amount of (disk) space is going to be available when it is needed. So what you are talking about here I think is expectation and reservation. A process or application claims a certain amount of space in advance. The system agrees to it. Maybe the total amount of claimed space is greater than what is available. Now processes (through the filesystem) are notified whether the space they have reserved is actually going be there, or whether they need to wait for that "robot cartridge retrieval system" and whether they want to wait or will quit. They knew they needed space and they reserved it in advance. The system had a way of knowing whether the promises could be met and the requests could be met. So the concept that keeps recurring here seems to be reservation of space in advance. That seems to be the holy grail now. Now I don't know but I assume you could develop a good model for this like you are trying here. Sparse files are difficult for me, I have never used them. I assume they could be considered sparse by nature and not likely to fill up. Filling up is of the same nature as expanding. The space they require is virtual space, their real space is the condensed space they actually take up. It is a different concept. You really need two measures for reporting on these files: real and virtual. So your filesystem might have 20G real space. Your sparse file is the only file. It uses 10G actual space. Its virtual file size is 2T. Free space is reported as 10G. Used space is given two measures: actual used space, and virtual used space. The question is how you store these. I think you should store them condensed. As such only the condensed blocks are given to the underlying block layer / LVM. I doubt you would want to create a virtual space from LVM such that your sparse files can use a huge filesystem in a non-condensed state sitting on that virtual space? But you can? Then the filesystem doesn't need to maintain blocklists or whatever, but keep in mind that normally a filesystem will take up a lot of space in inode structres and the like, when the filesystem is huge but the actual volume is not. If you create one thin pool, and a bunch of filesystems (thin volumes) of the same size, with default parameters, your entire thin pool will quickly fill up with just metadata structures. I don't know. I feel that sparse files are weird anyway, but if you use them, you'd want them to be condensed in the first place and existing in a sort of mapped state where virtual blocks are mapped to actual blocks. That doesn't need to be LVM and would feel odd there. That's not its purpose right. So for sparse you need a mapping at some point but I wouldn't abuse LVM for that primarily. I would say that is 80% filesystem and 20% LVM, or maybe even 60% custom system, 20% filesystem and 20% LVM. Many games pack their own filesystems, like we talked about earlier (when you discussed inefficiency of many small files in relation to 4k block sizes). If I really wanted sparse personally, as an application data storage model, I would first develop this model myself. I would probably want to map it myself. Maybe I'd want a custom filesystem for that. Maybe a loopback mounted custom filesystem, provided that its actual block file could grow. I would imagine allocating containers for it, and I would want the "real" filesystem to expand my containers or to create new instances of them. So instead of mapping my sectors directly, I would want to map them myself first, in a tiered system, and the filesystem to map the higher hierarchy level for me. E.g. I might have containers of 10G each allocated in advance, and when I need more, the filesystem allocates another one. So I map the virtual sectors to another virtual space, such that my containers container virtual space / container size = outer container addressing container virtual space % container size = inner container addressing outer container addressing goes to filesystem structure telling me (or it) where to write my data to. inner container addressing follows normal procedure, and writes "within a file". so you would have an overflow where the most significant bits cause container change. At that point I've already mapped my "real" sparse space to container space, its just that the filesystem allows me to address it without breaking a beat. What's the difference with a regular file that grows? You can attribute even more significant bits to filesystem change as well. You can have as many tiers as you want. You would get "falls outside of my jurisdiction" behaviour, "passing it on to someone else". LVM thin? Hardly relates to it. You could have addressing bits that reach to another planet ;-) :). > If a file system can be successfully closed with 'no errors' -- > doesn't that still mean it is "integrous" -- even if its sparse files > don't all have enough room to be expanded? Well that makes sense. But that's the same as saying that a thin pool is still "integrous" even though it is over-allocated. You are saying the same thing here, almost. You are basically saying: v-space > r-space == ok? Which is the basic premise of overprovisioning to begin with. With the added distinction of "assumed possible intent to go and fill up that space". Which comes down to: "I have a total real space of 2GB, but my filesystem is already 8GB. It's a bit deceitful, but I expect to be able to add more real space when required." There are two distinct cases: - total allotment > real space, but individual allotments < real space - total allotment > real space, AND individual allotments > real space I consider the first acceptable. The second is spending money you don't have. I would consider not ever creating an indvidual filesystem (volume) that is actually bigger (ON ITS OWN) than all the space that exists. I would never consider that. I think it is like living on debt. You borrow money to buy a house. It is that system. You borrow future time. You get something today but you will have to work for it for a long time, paying for something you bought years ago. So how do we deal with future time? That is the question. Is it acceptable to borrow money from the future? Is it acceptable to use space now, that you will only have tomorrow? > If a file system can be successfully closed with 'no errors' -- > doesn't that still mean it is "integrous" -- even if its sparse files > don't all have enough room to be expanded? If your sparse file has no intent to become non-sparse, then it is no issue. If your sparse file already tells you it is going to get you in trouble, it is different. This system is integrous depending on planned actions. Same is true for LVM now. The system is safe until some program decides to allocate the entire filesystem. And there are no checks and balances, the system will just crash. The peculiar condition is that you have built a floor. You have a floor, like a circular area of a certain surface area. But 1/3 of the floor is not actually there. You keep telling yourself not to go there. The entire circle appears to be there. But you know some parts are missing. That is the current nature of LVM thin. You know that if you step on certain zones, you will fall through and crash to the ground below. (I have had that happen as a kid. We were in the attic and we had covered the ladder gap with cardboard. Then, we (or at least I) forgot that the floor was not actually real and I walked on it, instantly falling through and ending on a step on the ladder below.) [ People here keep saying that a real admin would not walk on that ladder gap. A real admin would know where the gap was at all times. He would not step on it, and not fall though. But I've had it happen that I forgot where the gap was and I stepped on it anyway. ] > Does it make sense to think about a OOTS (OutOfThinSpace) daemon > that > can be setup with priorities to reclaim space? Does make some sense, certainly, to me at least, no matter if I understand little or are of no real importance here, but, I don't really understand the implications at this point. > Processes could also be willing to "give up memory and suspend" -- > where, when called, a handler could give back Giga-or Tera bytes of > memory > and save it's state as needing to restart the last pass. That is almost a calamity mode. I need to shower, but I was actually just painting the walls. Need to stop painting that shower, so I can use it for something else. I think it makes sense to lay a claim to some uncovered land, but when someone else also claims it, you discuss who needs it most, whether you feel like letting the other one have it, whose turn it is now, will it hurt you to let go of that. It is almost the same as reserving classrooms. So like I said, reservation. And like you say, only temporary space that you need for jobs. In a normal user system that is not computationally heavy, these things do not really arise, except maybe for video editing and the like. If you have large data jobs like you are talking about, I think you would need a different kind of scheduling system anyway. But not so much automatic. Running out of space is not a serious issue if the administrator system allots space to jobs. Doesn't have to be a filesystem doing that. But I guess your proposed daemon is just a layer above that, knowing about space constraints, and then allotting space to jobs based on priority queues. Again doesn't really have much to do with thin, unless every "job" would have its own "thin volume". And the "thin pool-volume system" would get used to "allot space" (the V-size of the volume) but if too much space was allotted, the system would get in trouble (overprovisioning) if all jobs run. Again, borrowing money from the future. The premise of LVM is not that every volume is going to be able to use all its space. It's not that it should, has to, or is going to fill up as a matter of course, as an expected and normal thing. You see thin LVM only works if the volumes are independent. In that job system they are not independent. The independence entails an expected growth that does happen on purpose. It involves a probability distribution in which the average of expected space usage to be less than the maximum. LVM thin is really a statistical thing basing itself on the laws of large numbers, averaging, and the expectation that if ONE volume is going to be max, another one won't. If you are going to allot jobs that are expected to completely fill up the reserved space, you are talking about an entirely different thing. You should provision based on average, but if average is max, it makes no sense anymore and you should just proportion according to available real space. You do not need thin volumes or a thin pool to do that sort of thing: just regular fixed-size filesystem with jobs and space requests. In other words, the amount of sane overprovisioning you can do is related to the difference between max and average. The different (max - average) is the amount you can safely overprovision given normal circumstances. You do not "on purpose" and willfully provision less than the average you expect. Average is your criterium. Max is the individual max size. Overprovisioning is the ability of an individual volume to grow beyond average towards max. If the calculations hold, some other volume will be below average. However if your numbers are smaller (not 1000s of volumes, but just a few) the variance grows enormously. And with the growth in variance you can no longer predict what is going to happen. But the real question is whether there is going to be any covariance, and in a real thin system, there should be none (independent). For instance, if there is some hype and all your clients suddenly start downloading the next best movie from 200G television, you already have covariance. Social unrest always indicates covariance. People stop making their own choices, and your predications and business and usual no longer hold true. Not because your values weren't sane. More like because people don't act naturally in those circumstances. Covariance indicates that there is a tertiary factor, causing (for instance) growth in (volumes) across the line. John buys a car, and Mary buys a house, but actually it is because they are getting married. Or, John buys a car, and Mary buys a house, but the common element is that they have both been brainwashed by contemporary economists working at the World Bank. All in all the insanity happens when you start to borrow from the future, which causes you to have to work your ass off to meet the demands you placed on yourself earlier, always having to rush, panic, and be under pressure. Better not overprovision beyond your average, in the sense of not even having enough for what you expect to happen. > From how it sounds -- when you run out of thin space, what happens > now is that the OS keeps allocating more Virtual space that has no > backing store (in memory or on disk)...with a notification buried in a > system log > somewhere. Sounds like the gold standard and having money that has no gold behind it or anything else of value. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1714078834.3820492.1461944731537.JavaMail.yahoo.ref@mail.yahoo.com>]
* Re: [linux-lvm] about the lying nature of thin [not found] <1714078834.3820492.1461944731537.JavaMail.yahoo.ref@mail.yahoo.com> @ 2016-04-29 15:45 ` matthew patton 2016-05-02 13:18 ` Mark H. Wood 0 siblings, 1 reply; 16+ messages in thread From: matthew patton @ 2016-04-29 15:45 UTC (permalink / raw) To: LVM general discussion and development > ~35GB each, meaning 35000 GB is available and 25000 is > in use, then it is not a lie to say to any individual customer: you can > use 50GB if you want. If enough of your so-called customers decide to use the space you promised them AND THAT THEY PAID FOR and instead they get massive data loss and outages, you can bet your hiney they'll sue you silly. If you want to play fast and loose in your basement that's one thing - Thin-away. If you try to pull a similar stunt in a commercial setting you either do your homework and put all necessary safeguards in place to prevent customer demand from overwhelming your cheap-sh*t corner cutting, or better have an attorney on retainer and budgeted for breach of contract settlements. > hold, but that is never communicated. Then you sir, will no doubt find yourself in front of a magistrate for no less than false representation. If the storage capacity you SOLD is not also explained in the terms of service that it doesn't really exist and that if they (or anyone they are unlucky enough to be co-located with) just so happen to write too fast to their storage that they may well lose their data. > As a customer you are not aware of how many other clients there are, or > how many other thin volumes (ordinarily) or what the max capacity is across all the > volumes. So you are not being lied to. I strongly suggest you go take a class on contract law (since OS basics is apparently beyond your grasp) and familiarize yourself with your country's prison conditions. At the very least, go talk to an attorney and pay him a consultation fee. As to the rest of your message, perhaps you'd get more insight and traction by doing your own blog to wax philosophical over cheating paying customers, engaging in data-losing computing practices, and being too cheap, lazy, or opinionated to run a responsible service. And for trotting out example after example of non-computer social conditions as if they had any relevance to the matter at hand. Now if you have something useful to say/ask about LVM, please continue. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-04-29 15:45 ` [linux-lvm] about the lying nature of thin matthew patton @ 2016-05-02 13:18 ` Mark H. Wood 2016-05-03 11:57 ` Xen 0 siblings, 1 reply; 16+ messages in thread From: Mark H. Wood @ 2016-05-02 13:18 UTC (permalink / raw) To: linux-lvm [-- Attachment #1: Type: text/plain, Size: 1623 bytes --] On Fri, Apr 29, 2016 at 03:45:31PM +0000, matthew patton wrote: > > ~35GB each, meaning 35000 GB is available and 25000 is > > in use, then it is not a lie to say to any individual customer: you can > > use 50GB if you want. > > If enough of your so-called customers decide to use the space you promised them AND THAT THEY PAID FOR and instead they get massive data loss and outages, you can bet your hiney they'll sue you silly. Executive summary: you shouldn't just take a wild guess and then turn your back on a thin-provisioned setup; you must understand your consumers and monitor your resources. It's reasonable in certain circumstances for a service provider to over-subscribe his hardware. He would be well advised to monitor actual allocation closely, to keep some cash or ready credit on hand for quick expansion of his real hardware, and to respond promptly by adding capacity when usage nears real hardware limits. He is taking a risk, betting that most customers won't max out their promised storage, and should manage that risk. Indeed, he should first gather statistics to understand the behavior of typical customers and determine whether he would be taking a *foolish* risk. Failure to adequately manage resources to redeem contracted promises is the provider's lie, not LVM's. Failure to plan is planning to fail. If that's too scary, don't use thin provisioning. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-05-02 13:18 ` Mark H. Wood @ 2016-05-03 11:57 ` Xen 0 siblings, 0 replies; 16+ messages in thread From: Xen @ 2016-05-03 11:57 UTC (permalink / raw) To: linux-lvm; +Cc: Mark H. Wood Mark H. Wood schreef op 02-05-2016 15:18: > Failure to adequately manage resources to redeem contracted promises > is the provider's lie, not LVM's. Failure to plan is planning to > fail. Exactly. And it starts being a lie when resources don't outlast use, and in some way the provider doesn't own up to that but let's it happen. That is irrespective however of the thought that choosing to or not choosing to communicate any part of that when it does happen or would happen, is a choice you can make and it doesn't take away from thin provisioning at all. If you feel you can always meet your expectations and those of your clients and work hard to achieve that, you may never run into the situation. However if you do run into the situation the choice becomes how to deal with that. You can also make a proactive choice in advance to either then be open, or to stick your head in the sand, as they proverbially say. I bet many contingency plans used in business everywhere have choices surrounding this being made in advance. When do we alert the public. When do we open up. When does it go so far that we cannot hide it anymore. In Dutch we call this "keeping in the dirty laundry" -- you only take the clean laundry out to dry (on a line). It is quite customary and usual for a human being not to want to give insight into private matters that might only confuse the other person. At the same time there is also the question of when to own up to stuff that is actually important to another person and I think this is a question of ethics. Sometimes people are not harmed by not knowing things, but you would be harmed by them knowing it. Sometimes people are harmed by not knowing things, and you are not harmed by them knowing it. I think that if we are talking about a business setting where you have promised a certain thing to people who are now depending on it, that the thing shifts in the direction of the second statement. If you have a contractual responsibility to deliver, you also have a contractual responsibility to inform. That is my opinion on the subject, at least. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1093625508.5537728.1462283037119.JavaMail.yahoo.ref@mail.yahoo.com>]
* Re: [linux-lvm] about the lying nature of thin [not found] <1093625508.5537728.1462283037119.JavaMail.yahoo.ref@mail.yahoo.com> @ 2016-05-03 13:43 ` matthew patton 2016-05-03 17:42 ` Xen 0 siblings, 1 reply; 16+ messages in thread From: matthew patton @ 2016-05-03 13:43 UTC (permalink / raw) To: LVM general discussion and development Xen wrote: > I didn't know thin (or LVM) doesn't maintain maps of used blocks. Right, so you're ignorant of basics like how the various subsystems work. Like I said, go find a text on OS and filesystem design. Hell, read the EXT and LVM code or even just the design docs. > The recent DISCARD improvements apparently just signal some special case > (?) but SSDs DO maintain maps or it wouldn't even work (?). Again, read up on the inner workings of SSDs. To over-simplify, SSDs have their own "LVM". No different really than a hardware RAID controller does - admittedly most raid controllers don't do anything particularly advanced. > I don't know, it would seem that having a map of used extents in a thin > pool is in some way deeply important in being able to allocate unused > ones? clearly you are in need of much more studying. LVM knows exactly out of all of it's defined extents which ones are free and which ones have been assigned to an LV - aka written to. What individual blocks (aka range of bytes) inside those extents have FS-managed data in them it knows not nor does it care. > I guess continuous polling would be deeply disrespectful of the hardware > and software resources. Not to mention instantaneously invalid. So you poll LVM, "what is your allocation map and do you have any free extents?" You get the results. Then the FS having been assured there is free space issues writes. But oh no, in the round-trip some other LV has grabbed the extent you had intended to use! IO=FAIL. The ONLY way for a FS to "reserve" a set of blocks (aka extent) to itself is to write to it - but mind the FS has NO IDEA if needs to do an reservation in the first place nor if this IO just so happens to fit inside the allocated range but the next IO at offset +1 will require a new extent to be allocated from the THINP. I haven't checked, but it's perfectly possible for LVM THINP to respond to FS issued DISCARD notices and thus build an allocation map of an extent. And should an extent be fully empty to return the extent to the thin pool. Only to have to allocate a new extent if any IO hits the same block range in the future. This kind of extent churn is probably not very useful unless your workload is in the habit of writing tons of data, freeing it and waiting a reasonable amount of time and potentially doing it again. SSDs resort to it because they must - it's the nature of the silicon device itself. > It would say to a filesystem: these regions are currently unavailable. > > You would even get more flags: > > - this region is entirely unavailable > - this region is now more expensive to allocate to > - this region is the preferred place All of this "inside knowledge" and "coordination" you so desperately seem to want is called integration. And again spelled BTRFS and ZFS. et. al. > In the theoretical system I proposed it would be a constant yeah, have fun with that theoretical system. ... Xen, dude seriously. Go do a LOT more reading. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [linux-lvm] about the lying nature of thin 2016-05-03 13:43 ` matthew patton @ 2016-05-03 17:42 ` Xen 0 siblings, 0 replies; 16+ messages in thread From: Xen @ 2016-05-03 17:42 UTC (permalink / raw) To: LVM general discussion and development matthew patton schreef op 03-05-2016 15:43: > Xen wrote: > >> I didn't know thin (or LVM) doesn't maintain maps of used blocks. > > Right, so you're ignorant of basics like how the various subsystems > work. Like I said, go find a text on OS and filesystem design. Hell, > read the EXT and LVM code or even just the design docs. Why don't you do it for me and then report back? I could use a slave like you are trying to make me. >> The recent DISCARD improvements apparently just signal some special >> case >> (?) but SSDs DO maintain maps or it wouldn't even work (?). > > Again, read up on the inner workings of SSDs. To over-simplify, SSDs > have their own "LVM". No different really than a hardware RAID > controller does - admittedly most raid controllers don't do anything > particularly advanced. It almost seems like you want me to succeed. > clearly you are in need of much more studying. LVM knows exactly out > of all of it's defined extents which ones are free and which ones have > been assigned to an LV - aka written to. What individual blocks (aka > range of bytes) inside those extents have FS-managed data in them it > knows not nor does it care. Then what is the issue here? That means my assumptions were all entirely correct and that Zdenek has said must have been false. But what you are saying now is extent assignments to LVs, do you imply this is also true of assignment to thin volumes? Yes, when you say "written to" you clearly mean thin pools. I never alluded that it needed to know or care about the actual usage of its blocks (extents). If a filesystem DISCARDs blocks, then with enough blocks it could discard an extent. I don't even know what will happen if a filesystem stops using the data that's on it, but I will test that now. And of course it should just free those blocks. It didn't work with mkswap just now, but creating a new filesystem causes lvs to report a lower thin pool usage. Of course, common and commonsensical. So these extents are being librated right? And it knows exactly how many are in use? Then what was this about: > Thin pool is not constructing 'free-maps' for each LV all the time - > that's why tools like 'thin_ls' are meant to be used from the > user-space. > It IS very EXPENSIVE operation. It is saying that e.g. lvs creates this free-map. But LVM needs to know at every moment in time what extents are available. It also needs to runtime liberate them. So it needs to be able to at least search for free ones and if none is found, to report that or do something with it. Of course that is different from having a map. But in-the-moment update operations to filesystems would not require a map. They would require mutations being communicated. Mutations that LVM already knows about. So it is nothing special. You don't need those "maps". You need to communicate (to other thin volumes) which extents have become unavailable. And which have become available once more. Then the thin volume translates this (possibly) to whatever block system the underlying filesystem uses. Logical blocks, physical blocks. The main organisation principle is the extent. It is not the LVM that needs to maintain a map. It is the filesystem. It needs to know about its potential for further allocation of the block space. >> I guess continuous polling would be deeply disrespectful of the >> hardware >> and software resources. > > Not to mention instantaneously invalid. So you poll LVM, "what is your > allocation map and do you have any free extents?" You get the results. > Then the FS having been assured there is free space issues writes. But > oh no, in the round-trip some other LV has grabbed the extent you had > intended to use! IO=FAIL. You know those contentions issues are everywhere and in the kernel also and they are always taken care of. Don't confront me with a situation that has already been solved by numerous other people. You forget, for once, that real software systems running on the filesystem would be aware of the lack of space to begin with. You are now approaching a corner case where the last free extent is being contended for. I am sure there would be an elegant solution to that. This corner case is not what it's all about. What it's about is that the filesystem has the means to predict what is going to happen, or at least the software running on it. If the situation you are describing is really an issue, you could simply reserve a last block (extent) for this scenario that is only written to if all other blocks are taken, and each filesystem (volume) has this free block of its own. PROBLEM SOLVED. You sound like Einstein when he tried to disprove Bohr's theory at that convention. In the end Bohr refuted everything and he (Einstein) had to accept he was right. A filesystem will simply reserve the equivalent of an extent. More importantly, the thin volume (logical volume) will. The thin LV will reserve one last extent in advance from the thin pool that is only really given to the filesystem under conditions that the entire thin pool is already taken now and the filesystem is still issueing a write to a new block because of a race condition that prevented it from knowing about the space issue. These are not difficult engineering problems. > The ONLY way for a FS to "reserve" a set of blocks (aka extent) to > itself is to write to it - but mind the FS has NO IDEA if needs to do > an reservation in the first place nor if this IO just so happens to > fit inside the allocated range but the next IO at offset +1 will > require a new extent to be allocated from the THINP. If you write to a full extent, you are guaranteed to get a new one. It's not more difficult than that. Don't make everything so difficult. I have not talked about reservations myself (prior to this). As we just said, if it is only about the very last block of the entire thin pool? Reserve it in advance and don't let the FS do it? If the race condition is such that larger amounts are needed for safety, do it? Reserve 200MB in advance if you need it? You could configure a thin pool / volume to reserve a certain amount of free space that is only going to be used if the thin pool is 100% filled and it wasn't possible to inform the file systems fast enough. Proportional to the size of the volume (LV). Who cares if you reserve 1% in each volume for this. Or less. A 2TB volume with 1GB of reserved space is not so bad, is it? That's just 0.05% give or take. If then free space is reported to the filesystem, it can: 1) simply inform programs by way of its normal operation 2) stop writing when the space known to it is gone 3) not have to worry about anything else because race conditions are taken care of. In the event that a filesystem starts randomly writing a single byte to every possible block in order to defeat this system. The filesystem can redirect these writes to other blocks when the LVM starts reporting no block for you and the filesystem still has space in the blocks it has. It will just have to invalidate some of its own blocks (extents). IT needs to maintain a map, not LVM. It can deduce its own free space from its own map. It would be like allocating a thin (sparse) file but then writing to every possible address of it along the range. Yes the system is going to bug but you can take care of it. Some writes will just fail when out of blocks, but the filesystem can redirect it, or in the end just fail writing / allocating. Any block being invalidated would instantly update its free space calculations. You don't need to communicate full maps unless you were creating a new filesystem or trying to recover from corruption. You would query "is this block available" for instance. That would require a new command. It would take a while but that way the filesystem could reconstruct the blockmap. Or it could query about ranges of blocks. This querying is the first thing you'd introduce. Blocks N to M, are they available? Yes or not. Or a list of the ones that are and the ones that aren't (a bitmap). To query 2000 extents you only need 2000 bits. That's 250 bytes, not a whole lot. A 2 TB volume would have a free map of 64k bytes. Do you imagine how small this is? How would maintaining free maps be an expensive operation, really? You need a fucking 64k field with a xor operation. That fits inside a 16-bit 8086 segment. I mean don't bullshit me here. There is no way it could be hard to maintain free maps. I'm a programmer too, you know. And I have been doing it since 1989 too. I have programmed in pascal and assembler and I have studied Java's BitMap class for instance. It can be done very elegantly. Any free map the thin LV would conjure up would be a lie in that sense, a choice. Because you would arbitrarily invalidate blocks at the end of the space. At the end of the virtual space. The pool communicates to the volume the acquisition and release of new and old extents. The volume at that point doesn't care which they are. It only needs to know the number. With every mutation it randomly invalidates a single block if it needs to (or enables it again). It sets a bit flag in a 64k field. So let's assume we have a 1PB volume. A petabyte. That's 2^50 / 2^20 / 4 number of extents, is 2^28 bits is 2^25 bytes. Is 2^5 megabytes is 32MB worth of data. For a volume with 1125899906842624 bytes. Just needs 33554432 bytes to maintain a map, if done in 4MB extents. If done in 4KB blocks the extent communication remains the same but it could amount to 1024x that number of bytes needed, is 32GB for a PB volume. Is still 1/32678 of its available address space, so to speak. But the filesystem could maintain maps of extents and not individual 'blocks'. Maybe 32GB is hard to communicate, but 32MB is not. And there are systems that have a terabyte of ram. > I haven't checked, but it's perfectly possible for LVM THINP to > respond to FS issued DISCARD notices and thus build an allocation map > of an extent. And should an extent be fully empty to return the extent > to the thin pool. I don't know how it is done currently, because clearly the system knows, right? As you say this is perfeclty possible. > Only to have to allocate a new extent if any IO hits > the same block range in the future. This kind of extent churn is > probably not very useful unless your workload is in the habit of > writing tons of data, freeing it and waiting a reasonable amount of > time and potentially doing it again. SSDs resort to it because they > must - it's the nature of the silicon device itself. Unused blocks need to be made available anyway. A filesystem on which 80% of data is deleted, and still using all those blocks in the thin pool? Please tell me this isn't reality (I know it isn't). So I make this test right I am just curious what will happen: 1. Create thin pool on other hard disk 400M 2. Create 3 thin volumes totalling 600M 3. Create filesystems (ext3) and mount them. 4. Copy 90MB file to them. After 4 files 360MB of pool is used. 5. Copy 5th file. Nothing happens. No errors, nothing. 6. Copy 6th file. Nothing happens. No errors, nothing. 7. I check volumes. Nothing seems the matter. Lvdisplay no unusual. df works and appears as though everything is normal. All volumes now 97% filled and pool 100% filled. Can't last right. I see kernel block device page errors come by. I go to one of the files that should have been successfully written (the 4th file). I try to copy it to my main disk. cp hangs. Terminal (tty) switching still works. Vim (I had vim open in 2 ttys or 3) stops responding. Alt-7 (should open KDE) nothing happens. Cannot switch back, ie. cannot switch TTYs anymore. System hangs completely. Mind you this was on a harddisk with no used volumes. No other volumes were mounted other than those 3 although of course they were loaded in LVM. There are no dropped volumes. There are no frozen volumes. The system just crashes. Very graceful I must say. I mean if this is the best you can do? No wonder you are suggesting every admin needs to hire a drill instructor to get him through the day. >> It would say to a filesystem: these regions are currently unavailable. >> >> You would even get more flags: >> >> - this region is entirely unavailable >> - this region is now more expensive to allocate to >> - this region is the preferred place > > All of this "inside knowledge" and "coordination" you so desperately > seem to want is called integration. And again spelled BTRFS and ZFS. > et. al. BTRFS is spelled "monopoly" and "wants to be all" and "I'm friends with SystemD" ;-). ZFS I don't know, I haven't cared about it. All I see on IRC is people talking about it like some new toy they desperately can't live without even though it doesn't serve them any real purpose. A bit like a toy drone worth 4k dollars. The only thing that changes is that filesystems maintain bitmaps of available sectors/blocks or of extents, and are capable of intelligently allocating to the ones they have and that are available. That's it! You can still choose what filesystem to use. You could even choose what volume manager to use. We have seen how little data it costs if the extent size is at least 4MB. We have seen how easy it would be to query again with the underlying layer in case you're not sure. If you want a block to have more bits, easy too! If you have only 4 possible states, you can put it in 2 bits. That would probably be enough for any probably use case. A 2TB volume costs 128k bytes for this bitmap with 4 states. That's something you can achieve on a 286 if you are crazy enough. > yeah, have fun with that theoretical system. Why won't you? > Xen, dude seriously. Go do a LOT more reading. I am being called by name :O! I think she likes me. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-05-10 23:58 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-28 22:37 [linux-lvm] about the lying nature of thin Xen 2016-04-29 8:44 ` Marek Podmaka 2016-04-29 10:06 ` Gionatan Danti 2016-04-29 13:16 ` Xen 2016-04-29 22:32 ` Xen 2016-04-30 4:46 ` Mark Mielke 2016-05-03 13:03 ` Xen 2016-04-29 11:53 ` Xen 2016-04-29 20:37 ` Chris Friesen 2016-05-10 21:47 ` [linux-lvm] thin disk -- like overcomitted/virtual memory? (was Re: about the lying nature of thin) Linda A. Walsh 2016-05-10 23:58 ` Xen [not found] <1714078834.3820492.1461944731537.JavaMail.yahoo.ref@mail.yahoo.com> 2016-04-29 15:45 ` [linux-lvm] about the lying nature of thin matthew patton 2016-05-02 13:18 ` Mark H. Wood 2016-05-03 11:57 ` Xen [not found] <1093625508.5537728.1462283037119.JavaMail.yahoo.ref@mail.yahoo.com> 2016-05-03 13:43 ` matthew patton 2016-05-03 17:42 ` Xen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.