On 6/4/19 12:25 PM, Alexander Duyck wrote: > On Tue, Jun 4, 2019 at 9:08 AM Nitesh Narayan Lal wrote: >> >> On 6/4/19 11:14 AM, Alexander Duyck wrote: >>> On Tue, Jun 4, 2019 at 5:55 AM Nitesh Narayan Lal wrote: >>>> On 6/3/19 3:04 PM, Alexander Duyck wrote: >>>>> On Mon, Jun 3, 2019 at 10:04 AM Nitesh Narayan Lal wrote: >>>>>> This patch introduces the core infrastructure for free page hinting in >>>>>> virtual environments. It enables the kernel to track the free pages which >>>>>> can be reported to its hypervisor so that the hypervisor could >>>>>> free and reuse that memory as per its requirement. >>>>>> >>>>>> While the pages are getting processed in the hypervisor (e.g., >>>>>> via MADV_FREE), the guest must not use them, otherwise, data loss >>>>>> would be possible. To avoid such a situation, these pages are >>>>>> temporarily removed from the buddy. The amount of pages removed >>>>>> temporarily from the buddy is governed by the backend(virtio-balloon >>>>>> in our case). >>>>>> >>>>>> To efficiently identify free pages that can to be hinted to the >>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big >>>>>> chunks are reported to the hypervisor - especially, to not break up THP >>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits >>>>>> in the bitmap are an indication whether a page *might* be free, not a >>>>>> guarantee. A new hook after buddy merging sets the bits. >>>>>> >>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue >>>>>> asynchronously processes the bitmaps, trying to isolate and report pages >>>>>> that are still free. The backend (virtio-balloon) is responsible for >>>>>> reporting these batched pages to the host synchronously. Once reporting/ >>>>>> freeing is complete, isolated pages are returned back to the buddy. >>>>>> >>>>>> There are still various things to look into (e.g., memory hotplug, more >>>>>> efficient locking, possible races when disabling). >>>>>> >>>>>> Signed-off-by: Nitesh Narayan Lal >>>>> So one thing I had thought about, that I don't believe that has been >>>>> addressed in your solution, is to determine a means to guarantee >>>>> forward progress. If you have a noisy thread that is allocating and >>>>> freeing some block of memory repeatedly you will be stuck processing >>>>> that and cannot get to the other work. Specifically if you have a zone >>>>> where somebody is just cycling the number of pages needed to fill your >>>>> hinting queue how do you get around it and get to the data that is >>>>> actually code instead of getting stuck processing the noise? >>>> It should not matter. As every time the memory threshold is met, entire >>>> bitmap >>>> is scanned and not just a chunk of memory for possible isolation. This >>>> will guarantee >>>> forward progress. >>> So I think there may still be some issues. I see how you go from the >>> start to the end, but how to you loop back to the start again as pages >>> are added? The init_hinting_wq doesn't seem to have a way to get back >>> to the start again if there is still work to do after you have >>> completed your pass without queue_work_on firing off another thread. >>> >> That will be taken care as the part of a new job, which will be >> en-queued as soon >> as the free memory count for the respective zone will reach the threshold. > So does that mean that you have multiple threads all calling > queue_work_on until you get below the threshold? Every time a page of order MAX_ORDER - 2 is added to the buddy, free memory count will be incremented if the bit is not already set and its value will be checked against the threshold. > If so it seems like > that would get expensive since that is an atomic test and set > operation that would be hammered until you get below that threshold. Not sure if I understood "until you get below that threshold". Can you please explain? test_and_set_bit() will be called every time a page with MAX_ORDER -2 order is added to the buddy. (Not already hinted) -- Regards Nitesh