All of lore.kernel.org
 help / color / mirror / Atom feed
* makedumpfile memory usage grows with system memory size
@ 2012-03-28 21:22 Don Zickus
  2012-03-29  8:09 ` Ken'ichi Ohmichi
  0 siblings, 1 reply; 34+ messages in thread
From: Don Zickus @ 2012-03-28 21:22 UTC (permalink / raw)
  To: oomichi; +Cc: kexec

Hello Ken'ichi-san,

I was talking to Vivek about kdump memory requirements and he mentioned
that they vary based on how much system memory is used.

I was interested in knowing why that was and again he mentioned that
makedumpfile needed lots of memory if it was running on a large machine
(for example 1TB of system memory).

Looking through the makedumpfile README and using what Vivek remembered of
makedumpfile, we gathered that as the number of pages grows, the more
makedumpfile has to temporarily store the information in memory.  The
possible reason was to calculate the size of the file before it was copied
to its final destination?

I was curious if that was true and if it was, would it be possible to only
process memory in chunks instead of all at once.

The idea is that a machine with 4Gigs of memory should consume the same
the amount of kdump runtime memory as a 1TB memory system.

Just trying to research ways to keep the memory requirements consistent
across all memory ranges.

Thanks,
Don


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-28 21:22 makedumpfile memory usage grows with system memory size Don Zickus
@ 2012-03-29  8:09 ` Ken'ichi Ohmichi
  2012-03-29 12:56   ` HATAYAMA Daisuke
                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Ken'ichi Ohmichi @ 2012-03-29  8:09 UTC (permalink / raw)
  To: Don Zickus; +Cc: kexec


Hi Don-san,

On Wed, 28 Mar 2012 17:22:04 -0400
Don Zickus <dzickus@redhat.com> wrote:
> 
> I was talking to Vivek about kdump memory requirements and he mentioned
> that they vary based on how much system memory is used.
> 
> I was interested in knowing why that was and again he mentioned that
> makedumpfile needed lots of memory if it was running on a large machine
> (for example 1TB of system memory).
> 
> Looking through the makedumpfile README and using what Vivek remembered of
> makedumpfile, we gathered that as the number of pages grows, the more
> makedumpfile has to temporarily store the information in memory.  The
> possible reason was to calculate the size of the file before it was copied
> to its final destination?

makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
The bitmap represents each page of 1st-kernel is excluded or not.
So the bitmap size depends on 1st-kernel's system memory.

makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap,
and the file is created on 2nd-kernel's memory if RHEL, because
RHEL does not mount a root filesystem when 2nd-kernel is running.


> I was curious if that was true and if it was, would it be possible to only
> process memory in chunks instead of all at once.
> 
> The idea is that a machine with 4Gigs of memory should consume the same
> the amount of kdump runtime memory as a 1TB memory system.
> 
> Just trying to research ways to keep the memory requirements consistent
> across all memory ranges.

I think the above purpose is good, and I don't have any idea for reducing
the bitmap size. And now I am out of makedumpfile development.
Kumagai-san is the makedumpfile maintainer now, and he will help you.


Thanks
Ken'ichi Ohmichi

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29  8:09 ` Ken'ichi Ohmichi
@ 2012-03-29 12:56   ` HATAYAMA Daisuke
  2012-03-29 13:25     ` Don Zickus
  2012-03-29 13:05   ` Don Zickus
  2012-04-02 17:15   ` Michael Holzheu
  2 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-03-29 12:56 UTC (permalink / raw)
  To: dzickus; +Cc: oomichi, kexec

Hello Don,

I'm missing your mail somehow so replying Oomichi-san's mail...

From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Thu, 29 Mar 2012 17:09:18 +0900

> 
> On Wed, 28 Mar 2012 17:22:04 -0400
> Don Zickus <dzickus@redhat.com> wrote:

>> I was curious if that was true and if it was, would it be possible to only
>> process memory in chunks instead of all at once.
>> 
>> The idea is that a machine with 4Gigs of memory should consume the same
>> the amount of kdump runtime memory as a 1TB memory system.
>> 
>> Just trying to research ways to keep the memory requirements consistent
>> across all memory ranges.

I think this is possible in constant memory space by creating bitmaps
and writing pages in a certain amount of memory. That is, if choosing
4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB,
12GB) ... in order. The key is to restrict the target memory range of
filtering.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29  8:09 ` Ken'ichi Ohmichi
  2012-03-29 12:56   ` HATAYAMA Daisuke
@ 2012-03-29 13:05   ` Don Zickus
  2012-03-30  9:43     ` Atsushi Kumagai
  2012-04-02 17:15   ` Michael Holzheu
  2 siblings, 1 reply; 34+ messages in thread
From: Don Zickus @ 2012-03-29 13:05 UTC (permalink / raw)
  To: Ken'ichi Ohmichi; +Cc: kexec

Hi Ken'ichi-san,

On Thu, Mar 29, 2012 at 05:09:18PM +0900, Ken'ichi Ohmichi wrote:
> 
> Hi Don-san,
> 
> On Wed, 28 Mar 2012 17:22:04 -0400
> Don Zickus <dzickus@redhat.com> wrote:
> > 
> > I was talking to Vivek about kdump memory requirements and he mentioned
> > that they vary based on how much system memory is used.
> > 
> > I was interested in knowing why that was and again he mentioned that
> > makedumpfile needed lots of memory if it was running on a large machine
> > (for example 1TB of system memory).
> > 
> > Looking through the makedumpfile README and using what Vivek remembered of
> > makedumpfile, we gathered that as the number of pages grows, the more
> > makedumpfile has to temporarily store the information in memory.  The
> > possible reason was to calculate the size of the file before it was copied
> > to its final destination?
> 
> makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
> The bitmap represents each page of 1st-kernel is excluded or not.
> So the bitmap size depends on 1st-kernel's system memory.
> 
> makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap,
> and the file is created on 2nd-kernel's memory if RHEL, because
> RHEL does not mount a root filesystem when 2nd-kernel is running.

Ok.

> 
> 
> > I was curious if that was true and if it was, would it be possible to only
> > process memory in chunks instead of all at once.
> > 
> > The idea is that a machine with 4Gigs of memory should consume the same
> > the amount of kdump runtime memory as a 1TB memory system.
> > 
> > Just trying to research ways to keep the memory requirements consistent
> > across all memory ranges.
> 
> I think the above purpose is good, and I don't have any idea for reducing
> the bitmap size. And now I am out of makedumpfile development.
> Kumagai-san is the makedumpfile maintainer now, and he will help you.

Thanks for the feedback, I'll wait for Kumagai-san's response then.

Cheers,
Don

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29 12:56   ` HATAYAMA Daisuke
@ 2012-03-29 13:25     ` Don Zickus
  2012-03-30  0:51       ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Don Zickus @ 2012-03-29 13:25 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: oomichi, kexec

Hello Daisuke,

On Thu, Mar 29, 2012 at 09:56:46PM +0900, HATAYAMA Daisuke wrote:
> Hello Don,
> 
> I'm missing your mail somehow so replying Oomichi-san's mail...
> 
> From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp>
> Subject: Re: makedumpfile memory usage grows with system memory size
> Date: Thu, 29 Mar 2012 17:09:18 +0900
> 
> > 
> > On Wed, 28 Mar 2012 17:22:04 -0400
> > Don Zickus <dzickus@redhat.com> wrote:
> 
> >> I was curious if that was true and if it was, would it be possible to only
> >> process memory in chunks instead of all at once.
> >> 
> >> The idea is that a machine with 4Gigs of memory should consume the same
> >> the amount of kdump runtime memory as a 1TB memory system.
> >> 
> >> Just trying to research ways to keep the memory requirements consistent
> >> across all memory ranges.
> 
> I think this is possible in constant memory space by creating bitmaps
> and writing pages in a certain amount of memory. That is, if choosing
> 4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB,
> 12GB) ... in order. The key is to restrict the target memory range of
> filtering.

Yes, that was what I was thinking.  I am glad to hear that is possible.
Is there some place in the code that I can help try out that idea?  I
would also be curious if there is a 'time' impact on how long it takes to
process this (for example, would it add a couple of milliseconds overhead
or seconds overhead).

Thanks,
Don

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29 13:25     ` Don Zickus
@ 2012-03-30  0:51       ` HATAYAMA Daisuke
  2012-04-02  7:46         ` Atsushi Kumagai
  0 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-03-30  0:51 UTC (permalink / raw)
  To: dzickus; +Cc: oomichi, kexec

From: Don Zickus <dzickus@redhat.com>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Thu, 29 Mar 2012 09:25:33 -0400

> On Thu, Mar 29, 2012 at 09:56:46PM +0900, HATAYAMA Daisuke wrote:

>> From: "Ken'ichi Ohmichi" <oomichi@mxs.nes.nec.co.jp>
>> Subject: Re: makedumpfile memory usage grows with system memory size
>> Date: Thu, 29 Mar 2012 17:09:18 +0900

>> > On Wed, 28 Mar 2012 17:22:04 -0400
>> > Don Zickus <dzickus@redhat.com> wrote:

>> >> I was curious if that was true and if it was, would it be possible to only
>> >> process memory in chunks instead of all at once.
>> >> 
>> >> The idea is that a machine with 4Gigs of memory should consume the same
>> >> the amount of kdump runtime memory as a 1TB memory system.
>> >> 
>> >> Just trying to research ways to keep the memory requirements consistent
>> >> across all memory ranges.

>> I think this is possible in constant memory space by creating bitmaps
>> and writing pages in a certain amount of memory. That is, if choosing
>> 4GB, do [0, 4GB) space processing, [4GB, 8GB) space processing, [8GB,
>> 12GB) ... in order. The key is to restrict the target memory range of
>> filtering.

> Yes, that was what I was thinking.  I am glad to hear that is possible.
> Is there some place in the code that I can help try out that idea?  I
> would also be curious if there is a 'time' impact on how long it takes to
> process this (for example, would it add a couple of milliseconds overhead
> or seconds overhead).

The part related is the path in create_dumpfile():

        if (!create_dump_bitmap())
                return FALSE;

        if (info->flag_split) {
                if ((status = writeout_multiple_dumpfiles()) == FALSE)
                        return FALSE;
        } else {
                if ((status = writeout_dumpfile()) == FALSE)
                        return FALSE;
        }

Now this part tries to do the task for a whole memory. So first this
needs to be generalized so that it does processing per range of
memory. If doing the processing three cycles, it would be as follows
pictorically in kdump-compressed format.
                                                
    +------------------------------------------+
    |    main header (struct disk_dump_header) |
    |------------------------------------------+
    |    sub header (struct kdump_sub_header)  |
    |------------------------------------------+
    |                                          | <-- 1st cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |            1st-bitmap                    | <-- 2nd cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |                                          | <-- 3rd cycle
    |------------------------------------------+
    |                                          | <-- 1st cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |            2nd-bitmap                    | <-- 2nd cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |                                          | <-- 3rd cycle
    |------------------------------------------+
    |                                          | <-- 1st cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |            page header                   | <-- 2nd cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |                                          | <-- 3rd cycle
    |------------------------------------------|
    |                                          |
    |            page data                     | <-- 1st cycle
    |                                          |
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |            page data                     | <-- 2nd cycle
      -  -  -  -  -  -  -  -  -  -  -  -  -  -
    |                                          |
    |                                          |
    |            page data                     | <-- 3rd cycle
    |                                          |
    |                                          |
    +------------------------------------------+

where the portions except for page data are in fixed length; so I
wrote only page data differently.

For the processing of writing pages per range of memory, it's useful
to reuse the code for --split's splitting features that split a single
dumpfile into a multiple dumpfiles, which has prepared data strucutre
to have start and end page frame numbers of the corresponding dumped
memory. For example, see part below in write_kdump_pages().

        if (info->flag_split) {
                start_pfn = info->split_start_pfn;
                end_pfn   = info->split_end_pfn;
        }
        else {
                start_pfn = 0;
                end_pfn   = info->max_mapnr;
        }

        for (pfn = start_pfn; pfn < end_pfn; pfn++) {

For the processing of creating and referencing bitmaps per range of
memory, there's no functions that do that. The ones for a whole memory
only: create_bitmap() and is_dumpable(). Also, creating bitmap depends
on source dumpfile format. Trying with ELF to kdump-compressed format
case first seems most handy; or if usecase is on the 2nd kernel only,
this case is enough?)

For performance impact, I don't know that exactly. But I guess
iterating filtering processing is most significant. I don't know exact
data structure for each kind of memory, but if there's the ones
needing linear order to look up the data for a given page frame
number, there would be necessary to add some special handling not to
reduce performance.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29 13:05   ` Don Zickus
@ 2012-03-30  9:43     ` Atsushi Kumagai
  2012-03-30 13:19       ` Don Zickus
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-03-30  9:43 UTC (permalink / raw)
  To: dzickus; +Cc: oomichi, kexec

Hello Don,
 
> > On Thu, 29 Mar 2012 09:05:14 -0400
> > Don Zickus <dzickus@redhat.com> wrote:
> > 
> > > Hi Ken'ichi-san,
> > > 
> > > On Thu, Mar 29, 2012 at 05:09:18PM +0900, Ken'ichi Ohmichi wrote:
> > > > 
> > > > Hi Don-san,
> > > > 
> > > > On Wed, 28 Mar 2012 17:22:04 -0400
> > > > Don Zickus <dzickus@redhat.com> wrote:
> > > > > 
> > > > > I was talking to Vivek about kdump memory requirements and he mentioned
> > > > > that they vary based on how much system memory is used.
> > > > > 
> > > > > I was interested in knowing why that was and again he mentioned that
> > > > > makedumpfile needed lots of memory if it was running on a large machine
> > > > > (for example 1TB of system memory).
> > > > > 
> > > > > Looking through the makedumpfile README and using what Vivek remembered of
> > > > > makedumpfile, we gathered that as the number of pages grows, the more
> > > > > makedumpfile has to temporarily store the information in memory.  The
> > > > > possible reason was to calculate the size of the file before it was copied
> > > > > to its final destination?
> > > > 
> > > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
> > > > The bitmap represents each page of 1st-kernel is excluded or not.
> > > > So the bitmap size depends on 1st-kernel's system memory.
> > > > 
> > > > makedumpfile creates a file /tmp/kdump_bitmapXXXXXX as the bitmap,
> > > > and the file is created on 2nd-kernel's memory if RHEL, because
> > > > RHEL does not mount a root filesystem when 2nd-kernel is running.
> > > 
> > > Ok.

Does setting TMPDIR solve your problem ? Please refer to the man page.


    ENVIRONMENT VARIABLES
           TMPDIR  This  environment  variable  is  for  a temporary memory bitmap
                   file.  If your machine has a lots of memory and you  use  tmpfs
                   on  /tmp,  makedumpfile can fail for a little memory in the 2nd
                   kernel because makedumpfile makes a very large temporary memory
                   bitmap  file in this case. To avoid this failure, you can set a
                   TMPDIR environment variable. If you do not set a  TMPDIR  envi-
                   ronment variable, makedumpfile uses /tmp directory for a tempo-
                   rary bitmap file as a default.


On the other hand, I'm considering the enhancement suggested by Hatayama-san now.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-30  9:43     ` Atsushi Kumagai
@ 2012-03-30 13:19       ` Don Zickus
  0 siblings, 0 replies; 34+ messages in thread
From: Don Zickus @ 2012-03-30 13:19 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: oomichi, kexec

On Fri, Mar 30, 2012 at 06:43:34PM +0900, Atsushi Kumagai wrote:
> Hello Don,
> Does setting TMPDIR solve your problem ? Please refer to the man page.
> 
> 
>     ENVIRONMENT VARIABLES
>            TMPDIR  This  environment  variable  is  for  a temporary memory bitmap
>                    file.  If your machine has a lots of memory and you  use  tmpfs
>                    on  /tmp,  makedumpfile can fail for a little memory in the 2nd
>                    kernel because makedumpfile makes a very large temporary memory
>                    bitmap  file in this case. To avoid this failure, you can set a
>                    TMPDIR environment variable. If you do not set a  TMPDIR  envi-
>                    ronment variable, makedumpfile uses /tmp directory for a tempo-
>                    rary bitmap file as a default.

I do not think it will because we run the second kernel inside the
initramfs and do not mount any extra disks.  So the only location available
for the temporary memory bitmap would be memory either tmpfs or something
else.  Regardless the file ends up in memory.

> 
> 
> On the other hand, I'm considering the enhancement suggested by Hatayama-san now.

His idea looks interesting if it works.  Thanks.

Cheers,
Don

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-30  0:51       ` HATAYAMA Daisuke
@ 2012-04-02  7:46         ` Atsushi Kumagai
  2012-04-05  6:52           ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-04-02  7:46 UTC (permalink / raw)
  To: d.hatayama; +Cc: dzickus, oomichi, kexec

Hello Hatayama-san,

On Fri, 30 Mar 2012 09:51:43 +0900 (   )
HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
 
> For the processing of writing pages per range of memory, it's useful
> to reuse the code for --split's splitting features that split a single
> dumpfile into a multiple dumpfiles, which has prepared data strucutre
> to have start and end page frame numbers of the corresponding dumped
> memory. For example, see part below in write_kdump_pages().
> 
>         if (info->flag_split) {
>                 start_pfn = info->split_start_pfn;
>                 end_pfn   = info->split_end_pfn;
>         }
>         else {
>                 start_pfn = 0;
>                 end_pfn   = info->max_mapnr;
>         }
> 
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> 
> For the processing of creating and referencing bitmaps per range of
> memory, there's no functions that do that. The ones for a whole memory
> only: create_bitmap() and is_dumpable(). Also, creating bitmap depends
> on source dumpfile format. Trying with ELF to kdump-compressed format
> case first seems most handy; or if usecase is on the 2nd kernel only,
> this case is enough?)
> 
> For performance impact, I don't know that exactly. But I guess
> iterating filtering processing is most significant. I don't know exact
> data structure for each kind of memory, but if there's the ones
> needing linear order to look up the data for a given page frame
> number, there would be necessary to add some special handling not to
> reduce performance.

Thank you for your idea.

I think this is an important issue and I have no idea except iterating
filtering processes for each memory range.

But as you said, we should consider the issue related to performance.
For example, makedumpfile must parse free_list repeatedly to distinguish
whether each pfn is a free page or not, because each range may be inside
the same zone. It will be overhead.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-03-29  8:09 ` Ken'ichi Ohmichi
  2012-03-29 12:56   ` HATAYAMA Daisuke
  2012-03-29 13:05   ` Don Zickus
@ 2012-04-02 17:15   ` Michael Holzheu
  2012-04-06  8:09     ` Atsushi Kumagai
  2 siblings, 1 reply; 34+ messages in thread
From: Michael Holzheu @ 2012-04-02 17:15 UTC (permalink / raw)
  To: Ken'ichi Ohmichi; +Cc: Don Zickus, kexec

Hello Ken'ichi,

On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote:
> On Wed, 28 Mar 2012 17:22:04 -0400
> makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
> The bitmap represents each page of 1st-kernel is excluded or not.
> So the bitmap size depends on 1st-kernel's system memory.

Does this mean that makedumpfile's memory demand linearly grows with 1
bit per page of 1-st kernel's memory?

Is that the exact factor, if /tmp is in memory? Or is there any other
memory allocation that is not constant regarding the 1-st kernel memory
size?

Michael




_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-02  7:46         ` Atsushi Kumagai
@ 2012-04-05  6:52           ` HATAYAMA Daisuke
  2012-04-05 14:34             ` Vivek Goyal
  0 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-04-05  6:52 UTC (permalink / raw)
  To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Mon, 2 Apr 2012 16:46:51 +0900

> On Fri, 30 Mar 2012 09:51:43 +0900 (   )
> HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

>> For performance impact, I don't know that exactly. But I guess
>> iterating filtering processing is most significant. I don't know exact
>> data structure for each kind of memory, but if there's the ones
>> needing linear order to look up the data for a given page frame
>> number, there would be necessary to add some special handling not to
>> reduce performance.

> 
> Thank you for your idea.
> 
> I think this is an important issue and I have no idea except iterating
> filtering processes for each memory range.
> 
> But as you said, we should consider the issue related to performance.
> For example, makedumpfile must parse free_list repeatedly to distinguish
> whether each pfn is a free page or not, because each range may be inside
> the same zone. It will be overhead.
> 

Hello Kumagai-san,

I looked into contents of free_list and confirmed that even buddies
with the same order are not ordered linearly. The below is the output
of makedumpfile I customized so it outputs buddy data.

# ./makedumpfile --message-level 32 -c -d 31 /media/127.0.0.1-2012-04-04-20:31:58/vmcore vmcore-cd31
NR_ZONE: 0
order: 10 migrate_type: 2 pfn: 3072
order: 10 migrate_type: 2 pfn: 2048
order: 10 migrate_type: 2 pfn: 1024
order: 9 migrate_type: 3 pfn: 512
order: 8 migrate_type: 0 pfn: 256
order: 6 migrate_type: 0 pfn: 64
order: 5 migrate_type: 0 pfn: 32
order: 4 migrate_type: 0 pfn: 128
order: 4 migrate_type: 0 pfn: 16
order: 2 migrate_type: 0 pfn: 144
order: 1 migrate_type: 0 pfn: 148
NR_ZONE: 1
order: 10 migrate_type: 2 pfn: 226304
order: 10 migrate_type: 2 pfn: 225280
order: 10 migrate_type: 2 pfn: 486400
order: 10 migrate_type: 2 pfn: 485376
order: 10 migrate_type: 2 pfn: 484352
order: 10 migrate_type: 2 pfn: 483328
order: 10 migrate_type: 2 pfn: 482304
order: 10 migrate_type: 2 pfn: 481280
<snip>

We cannot choose the way of simply walking free_list in the increasing
order w.r.t. pfn for a given range of memory, suspend the walking and
save the data for the next walking...

So, it's necessary to create a table for access in constant time. But
for that, the table needs to be created on the memory. On the 2nd
kernel, we cannot assume any backing store in general: consider scp
for example.

I think basic idea would be several efforts for small memory
programming, like:

  * Create part of bitmap corresponding to range of memory currently
    being processed only, and table creation processing is repeated
    each time range of memory is started.
    => difficult to avoid looking up a whole part of free_list every
    time, but this is only idea I come up with that makes it always
    possible that consumed memory is stably constant.

  * Have table in memory mapping form rather than bitmap, switch back
    to bitmap if the size gets larger than the bitmap's
    => bad performance on very fragmented case, and constructing
    memory mapping requires O(n^2) so would cost high if doing it
    multiple times.

  * Compress part of bitmap except for the one currently being
    processed
    => bad performance when compression doesn't work well
       bad performance when compression is done too many times

But before that, I want to also consider possibility of increasing
reserved memory for the 2nd kernel.

On the discussion of 512MB reservation regression last month, Vivek
explained that 512MB is current maximam value and enough for at most
6TB system.

  https://lkml.org/lkml/2012/3/13/372

But on such machine, where makedumpfile perforamce is affected, there
seems to be a room to reserve more 512MB memory. Also Yinghai said
following Vivek, system memory size will still grow in next years.

Note:
  * 1 bit in bitmap represents 1 page frame. On x86, 1 byte is for
    32kB memory. 1TB memory requres 32MB. Dump includes two bitmaps so
    64MB is needed in total.
  * Bad performance is free pages only. Cache, cache private, user and
    zero pages are processed per range of memory in good performance.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-05  6:52           ` HATAYAMA Daisuke
@ 2012-04-05 14:34             ` Vivek Goyal
  2012-04-06  1:12               ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2012-04-05 14:34 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

On Thu, Apr 05, 2012 at 03:52:11PM +0900, HATAYAMA Daisuke wrote:

[..]
>   * Bad performance is free pages only. Cache, cache private, user and
>     zero pages are processed per range of memory in good performance.

Hi Daisuke-san,

I am wondering why can't we walk through the memmap array and look into
struct page for figuring out if page is free or not. Looks like that
in the past we used to have PG_buddy flag and same information possibly
could be retrieved by looking at page->_count field. 

So I am just curious that why do we walk through free pages list to figure
out free pages instead of looking at "struct page".

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-05 14:34             ` Vivek Goyal
@ 2012-04-06  1:12               ` HATAYAMA Daisuke
  2012-04-06  8:59                 ` Atsushi Kumagai
  2012-04-09 19:00                 ` Vivek Goyal
  0 siblings, 2 replies; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-04-06  1:12 UTC (permalink / raw)
  To: vgoyal; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Thu, 5 Apr 2012 10:34:39 -0400

> On Thu, Apr 05, 2012 at 03:52:11PM +0900, HATAYAMA Daisuke wrote:
> 
> [..]
>>   * Bad performance is free pages only. Cache, cache private, user and
>>     zero pages are processed per range of memory in good performance.
> 
> Hi Daisuke-san,
> 

Hello Vivek,

> I am wondering why can't we walk through the memmap array and look into
> struct page for figuring out if page is free or not. Looks like that
> in the past we used to have PG_buddy flag and same information possibly
> could be retrieved by looking at page->_count field. 
> 
> So I am just curious that why do we walk through free pages list to figure
> out free pages instead of looking at "struct page".

Thanks. To be honest, I have just beginning with reading around here
and known PG_buddy just now. I have small checked this fact on 2.6.18
with the patch in the bottom of this mail and free pages found from
free_list and by PG_buddy check are coincide.

As Vivek says, more recent kernel has change around PG_buddy and the
patch says we should check _mapcount; I have yet to check this.

Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Thu Jan 13 15:47:00 2011 -0800

     thp: remove PG_buddy

    PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
    be added to page->flags without overflowing (because of the sparse section
    bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
    has to move the memory hotplug code from _mapcount to lru.next to avoid
    any risk of clashes.  We can't use lru.next for PG_buddy removal, but
    memory hotplug can use lru.next even more easily than the mapcount
    instead.

    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

$ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0
v2.6.37-7012-g5f24ce5

So now we can walk on the memmap array also for free pages like other
kinds of memory. The question I have now is why the current
implementation was chosen. Is there any difference between two ways?

Subject: [PATCH] Add free pages message

---
 makedumpfile.c |    9 +++++++++
 makedumpfile.h |    1 +
 print_info.h   |    2 +-
 3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index c843567..bd770b1 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -3198,6 +3198,9 @@ reset_bitmap_of_free_pages(unsigned long node_zones)
                                        retcd = ANALYSIS_FAILED;
                                        return FALSE;
                                }
+
+                               FREEPAGE_MSG("order: %d migrate_type: %d pfn: %llu\n", order, migrate_type, start_pfn);
+
                                for (i = 0; i < (1<<order); i++) {
                                        pfn = start_pfn + i;
                                        clear_bit_on_2nd_bitmap_for_kernel(pfn);
@@ -3399,6 +3402,7 @@ _exclude_free_page(void)
                        }
                        if (!spanned_pages)
                                continue;
+                       FREEPAGE_MSG("NR_ZONE: %d\n", i);
                        if (!reset_bitmap_of_free_pages(zone))
                                return FALSE;
                }
@@ -3688,6 +3692,11 @@ __exclude_unnecessary_pages(unsigned long mem_map,
                _count  = UINT(pcache + OFFSET(page._count));
                mapping = ULONG(pcache + OFFSET(page.mapping));

+               if ((info->dump_level & DL_EXCLUDE_FREE)
+                   && (flags & (1UL << PG_flag))) {
+                       FREEPAGE_MSG("PG_flag: flags: %#016lx pfn %llu\n", flags, pfn);
+               }
+
                /*
                 * Exclude the cache page without the private page.
                 */
diff --git a/makedumpfile.h b/makedumpfile.h
index ed1e9de..1faef47 100644
--- a/makedumpfile.h
+++ b/makedumpfile.h
@@ -67,6 +67,7 @@ int get_mem_type(void);
 #define PG_lru_ORIGINAL                (5)
 #define PG_private_ORIGINAL    (11)    /* Has something at ->private */
 #define PG_swapcache_ORIGINAL  (15)    /* Swap page: swp_entry_t in private */
+#define PG_buddy               (19)

 #define PAGE_MAPPING_ANON      (1)

diff --git a/print_info.h b/print_info.h
index 94968ca..44415d3 100644
--- a/print_info.h
+++ b/print_info.h
@@ -42,7 +42,7 @@ void print_execution_time(char *step_name, struct timeval *tv_start);
  * Message Level
  */
 #define MIN_MSG_LEVEL          (0)
-#define MAX_MSG_LEVEL          (31)
+#define MAX_MSG_LEVEL          (31+0x20)
 #define DEFAULT_MSG_LEVEL      (7)     /* Print the progress indicator, the
                                           common message, the error message */
 #define ML_PRINT_PROGRESS      (0x001) /* Print the progress indicator */
--
1.7.4.4

Thanks,
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-02 17:15   ` Michael Holzheu
@ 2012-04-06  8:09     ` Atsushi Kumagai
  2012-04-11  8:04       ` Michael Holzheu
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-04-06  8:09 UTC (permalink / raw)
  To: holzheu; +Cc: dzickus, oomichi, kexec

Hello Michael,

On Mon, 02 Apr 2012 19:15:33 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Hello Ken'ichi,
> 
> On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote:
> > On Wed, 28 Mar 2012 17:22:04 -0400
> > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
> > The bitmap represents each page of 1st-kernel is excluded or not.
> > So the bitmap size depends on 1st-kernel's system memory.
> 
> Does this mean that makedumpfile's memory demand linearly grows with 1
> bit per page of 1-st kernel's memory?

Yes, you are right. (Precisely, 2 bit per page.)

> Is that the exact factor, if /tmp is in memory? Or is there any other
> memory allocation that is not constant regarding the 1-st kernel memory
> size?

bitmap file is main cause of memory consuming if 2nd kernel uses initramfs
only. There are other parts where the size of allocated memory varies based
on 1-st kernel memory size, but they don't have big influence.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-06  1:12               ` HATAYAMA Daisuke
@ 2012-04-06  8:59                 ` Atsushi Kumagai
  2012-04-06  9:29                   ` HATAYAMA Daisuke
  2012-04-09 19:00                 ` Vivek Goyal
  1 sibling, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-04-06  8:59 UTC (permalink / raw)
  To: d.hatayama; +Cc: dzickus, oomichi, kexec, vgoyal

Hello Hatayama-san,

On Fri, 06 Apr 2012 10:12:12 +0900 (   )
HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

> Thanks. To be honest, I have just beginning with reading around here
> and known PG_buddy just now. I have small checked this fact on 2.6.18
> with the patch in the bottom of this mail and free pages found from
> free_list and by PG_buddy check are coincide.
> 
> As Vivek says, more recent kernel has change around PG_buddy and the
> patch says we should check _mapcount; I have yet to check this.
> 
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date:   Thu Jan 13 15:47:00 2011 -0800
> 
>      thp: remove PG_buddy
> 
>     PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
>     be added to page->flags without overflowing (because of the sparse section
>     bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
>     has to move the memory hotplug code from _mapcount to lru.next to avoid
>     any risk of clashes.  We can't use lru.next for PG_buddy removal, but
>     memory hotplug can use lru.next even more easily than the mapcount
>     instead.
> 
>     Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> $ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0
> v2.6.37-7012-g5f24ce5
> 
> So now we can walk on the memmap array also for free pages like other
> kinds of memory. The question I have now is why the current
> implementation was chosen. Is there any difference between two ways?

We just referred to the implementation of disk_dump.

Now, I'm checking the validity of using _count field to figure out free pages.
I would like to use _count rather than PG_buddy because I would like to avoid
changing behavior based on versions as long as possible.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-06  8:59                 ` Atsushi Kumagai
@ 2012-04-06  9:29                   ` HATAYAMA Daisuke
  2012-04-09 18:57                     ` Vivek Goyal
  0 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-04-06  9:29 UTC (permalink / raw)
  To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Fri, 6 Apr 2012 17:59:24 +0900

> Hello Hatayama-san,
> 
> On Fri, 06 Apr 2012 10:12:12 +0900 (   )
> HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> 
>> Thanks. To be honest, I have just beginning with reading around here
>> and known PG_buddy just now. I have small checked this fact on 2.6.18
>> with the patch in the bottom of this mail and free pages found from
>> free_list and by PG_buddy check are coincide.
>> 
>> As Vivek says, more recent kernel has change around PG_buddy and the
>> patch says we should check _mapcount; I have yet to check this.
>> 
>> Author: Andrea Arcangeli <aarcange@redhat.com>
>> Date:   Thu Jan 13 15:47:00 2011 -0800
>> 
>>      thp: remove PG_buddy
>> 
>>     PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
>>     be added to page->flags without overflowing (because of the sparse section
>>     bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
>>     has to move the memory hotplug code from _mapcount to lru.next to avoid
>>     any risk of clashes.  We can't use lru.next for PG_buddy removal, but
>>     memory hotplug can use lru.next even more easily than the mapcount
>>     instead.
>> 
>>     Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> 
>> $ git describe 5f24ce5fd34c3ca1b3d10d30da754732da64d5c0
>> v2.6.37-7012-g5f24ce5
>> 
>> So now we can walk on the memmap array also for free pages like other
>> kinds of memory. The question I have now is why the current
>> implementation was chosen. Is there any difference between two ways?
> 
> We just referred to the implementation of disk_dump.
> 
> Now, I'm checking the validity of using _count field to figure out free pages.
> I would like to use _count rather than PG_buddy because I would like to avoid
> changing behavior based on versions as long as possible.
> 

I agree. On the other hand, there is one more thing to consider. The
value of order is in private member of the page descripter. Now
there's no information for private member in VMCOREINFO. If we choose
this method and delete the current one, it's necessary to prepare
vmlinux file for old kernels.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-06  9:29                   ` HATAYAMA Daisuke
@ 2012-04-09 18:57                     ` Vivek Goyal
  2012-04-09 23:58                       ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2012-04-09 18:57 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote:

[..]
> I agree. On the other hand, there is one more thing to consider. The
> value of order is in private member of the page descripter. Now
> there's no information for private member in VMCOREINFO. If we choose
> this method and delete the current one, it's necessary to prepare
> vmlinux file for old kernels.

What information do you need to access "private" member of "struct page".
offset? Can't we extend VMCOREINFO to export this info too?

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-06  1:12               ` HATAYAMA Daisuke
  2012-04-06  8:59                 ` Atsushi Kumagai
@ 2012-04-09 19:00                 ` Vivek Goyal
  1 sibling, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2012-04-09 19:00 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

On Fri, Apr 06, 2012 at 10:12:12AM +0900, HATAYAMA Daisuke wrote:

[..]
> So now we can walk on the memmap array also for free pages like other
> kinds of memory. The question I have now is why the current
> implementation was chosen. Is there any difference between two ways?

I don't know but I am guessing may be going through buddy allocator
data structures is faster when there is tons of memory in the system and
number of free pages is less.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-09 18:57                     ` Vivek Goyal
@ 2012-04-09 23:58                       ` HATAYAMA Daisuke
  2012-04-10 12:52                         ` Vivek Goyal
  0 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-04-09 23:58 UTC (permalink / raw)
  To: vgoyal; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Mon, 9 Apr 2012 14:57:28 -0400

> On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote:
> 
> [..]
>> I agree. On the other hand, there is one more thing to consider. The
>> value of order is in private member of the page descripter. Now
>> there's no information for private member in VMCOREINFO. If we choose
>> this method and delete the current one, it's necessary to prepare
>> vmlinux file for old kernels.
> 
> What information do you need to access "private" member of "struct page".
> offset? Can't we extend VMCOREINFO to export this info too?
> 

Yes, I mean offset of private member in page structure. The member
contains order of the buddy. Extending VMCOREINFO is easy, but we
cannot do that for old kernels, for which vmlinux is needed
separately.

This might be the same as what Kumagai-san says he doesn' want to
change behaviour on kernel versions.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-09 23:58                       ` HATAYAMA Daisuke
@ 2012-04-10 12:52                         ` Vivek Goyal
  2012-04-12  3:40                           ` Atsushi Kumagai
  0 siblings, 1 reply; 34+ messages in thread
From: Vivek Goyal @ 2012-04-10 12:52 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: dzickus, oomichi, kumagai-atsushi, kexec

On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote:
> From: Vivek Goyal <vgoyal@redhat.com>
> Subject: Re: makedumpfile memory usage grows with system memory size
> Date: Mon, 9 Apr 2012 14:57:28 -0400
> 
> > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote:
> > 
> > [..]
> >> I agree. On the other hand, there is one more thing to consider. The
> >> value of order is in private member of the page descripter. Now
> >> there's no information for private member in VMCOREINFO. If we choose
> >> this method and delete the current one, it's necessary to prepare
> >> vmlinux file for old kernels.
> > 
> > What information do you need to access "private" member of "struct page".
> > offset? Can't we extend VMCOREINFO to export this info too?
> > 
> 
> Yes, I mean offset of private member in page structure. The member
> contains order of the buddy. Extending VMCOREINFO is easy, but we
> cannot do that for old kernels, for which vmlinux is needed
> separately.
> 
> This might be the same as what Kumagai-san says he doesn' want to
> change behaviour on kernel versions.

We can retain both the mechanisms. For newer kernels which export
page->private offset, we can walk through memmap array and prepare a
chunk of bitmap and discard it. For older kernels we can continue to walk
through free pages list and prepare big bitmap in userspace.

It is desirable to keep mechanism same across kernel versions, but
change is unavoidable as things evolve in newer kernels. So at max
we can provide backward compatibility with old kernels.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-06  8:09     ` Atsushi Kumagai
@ 2012-04-11  8:04       ` Michael Holzheu
  2012-04-12  8:49         ` Atsushi Kumagai
  0 siblings, 1 reply; 34+ messages in thread
From: Michael Holzheu @ 2012-04-11  8:04 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: dzickus, oomichi, kexec

Hello Kumagai,

On Fri, 2012-04-06 at 17:09 +0900, Atsushi Kumagai wrote:
> Hello Michael,
> 
> On Mon, 02 Apr 2012 19:15:33 +0200
> Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> 
> > Hello Ken'ichi,
> > 
> > On Thu, 2012-03-29 at 17:09 +0900, Ken'ichi Ohmichi wrote:
> > > On Wed, 28 Mar 2012 17:22:04 -0400
> > > makedumpfile uses the system memory of 2nd-kernel for a bitmap if RHEL.
> > > The bitmap represents each page of 1st-kernel is excluded or not.
> > > So the bitmap size depends on 1st-kernel's system memory.
> > 
> > Does this mean that makedumpfile's memory demand linearly grows with 1
> > bit per page of 1-st kernel's memory?
> 
> Yes, you are right. (Precisely, 2 bit per page.)
> 
> > Is that the exact factor, if /tmp is in memory? Or is there any other
> > memory allocation that is not constant regarding the 1-st kernel memory
> > size?
> 
> bitmap file is main cause of memory consuming if 2nd kernel uses initramfs
> only. There are other parts where the size of allocated memory varies based
> on 1-st kernel memory size, but they don't have big influence.

Thanks for the explanation. 

I ask because I want to exactly calculate the required size for the
crashkernel parameter. On s390 the kdump kernel memory consumption is
fix and not dependent on the 1st kernel memory size. So based on your
explanation I, will use:

crashkernel=<base size> + <variable size>

where

<variable size> = <pages of 1st kernel> * (2 + x) / 8

where "x" is the variable makedumpfile memory allocation that is on top
of the bitmap allocation. What would be a good value for "x"?

Michael


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-10 12:52                         ` Vivek Goyal
@ 2012-04-12  3:40                           ` Atsushi Kumagai
  2012-04-12  7:47                             ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-04-12  3:40 UTC (permalink / raw)
  To: vgoyal; +Cc: dzickus, oomichi, d.hatayama, kexec

Hello,

On Tue, 10 Apr 2012 08:52:05 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote:
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Subject: Re: makedumpfile memory usage grows with system memory size
> > Date: Mon, 9 Apr 2012 14:57:28 -0400
> > 
> > > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote:
> > > 
> > > [..]
> > >> I agree. On the other hand, there is one more thing to consider. The
> > >> value of order is in private member of the page descripter. Now
> > >> there's no information for private member in VMCOREINFO. If we choose
> > >> this method and delete the current one, it's necessary to prepare
> > >> vmlinux file for old kernels.
> > > 
> > > What information do you need to access "private" member of "struct page".
> > > offset? Can't we extend VMCOREINFO to export this info too?
> > > 
> > 
> > Yes, I mean offset of private member in page structure. The member
> > contains order of the buddy. Extending VMCOREINFO is easy, but we
> > cannot do that for old kernels, for which vmlinux is needed
> > separately.
> > 
> > This might be the same as what Kumagai-san says he doesn' want to
> > change behaviour on kernel versions.
> 
> We can retain both the mechanisms. For newer kernels which export
> page->private offset, we can walk through memmap array and prepare a
> chunk of bitmap and discard it. For older kernels we can continue to walk
> through free pages list and prepare big bitmap in userspace.
> 
> It is desirable to keep mechanism same across kernel versions, but
> change is unavoidable as things evolve in newer kernels. So at max
> we can provide backward compatibility with old kernels.

I said I want to avoid changing behavior based on kernel versions, 
but it seems difficult as Vivek said. So, I will accept the changing
if it is necessary.

Now, I will make two prototypes to consider the method to figure out
free pages.

  - a prototype based on _count
  - a prototype based on PG_buddy (or _mapcount)
  
If prototypes work fine, then we can select the method.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-12  3:40                           ` Atsushi Kumagai
@ 2012-04-12  7:47                             ` HATAYAMA Daisuke
       [not found]                               ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
  0 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-04-12  7:47 UTC (permalink / raw)
  To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Thu, 12 Apr 2012 12:40:40 +0900

> Hello,
> 
> On Tue, 10 Apr 2012 08:52:05 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
>> On Tue, Apr 10, 2012 at 08:58:24AM +0900, HATAYAMA Daisuke wrote:
>> > From: Vivek Goyal <vgoyal@redhat.com>
>> > Subject: Re: makedumpfile memory usage grows with system memory size
>> > Date: Mon, 9 Apr 2012 14:57:28 -0400
>> > 
>> > > On Fri, Apr 06, 2012 at 06:29:40PM +0900, HATAYAMA Daisuke wrote:
>> > > 
>> > > [..]
>> > >> I agree. On the other hand, there is one more thing to consider. The
>> > >> value of order is in private member of the page descripter. Now
>> > >> there's no information for private member in VMCOREINFO. If we choose
>> > >> this method and delete the current one, it's necessary to prepare
>> > >> vmlinux file for old kernels.
>> > > 
>> > > What information do you need to access "private" member of "struct page".
>> > > offset? Can't we extend VMCOREINFO to export this info too?
>> > > 
>> > 
>> > Yes, I mean offset of private member in page structure. The member
>> > contains order of the buddy. Extending VMCOREINFO is easy, but we
>> > cannot do that for old kernels, for which vmlinux is needed
>> > separately.
>> > 
>> > This might be the same as what Kumagai-san says he doesn' want to
>> > change behaviour on kernel versions.
>> 
>> We can retain both the mechanisms. For newer kernels which export
>> page->private offset, we can walk through memmap array and prepare a
>> chunk of bitmap and discard it. For older kernels we can continue to walk
>> through free pages list and prepare big bitmap in userspace.
>> 
>> It is desirable to keep mechanism same across kernel versions, but
>> change is unavoidable as things evolve in newer kernels. So at max
>> we can provide backward compatibility with old kernels.
> 
> I said I want to avoid changing behavior based on kernel versions, 
> but it seems difficult as Vivek said. So, I will accept the changing
> if it is necessary.
> 
> Now, I will make two prototypes to consider the method to figure out
> free pages.
> 
>   - a prototype based on _count
>   - a prototype based on PG_buddy (or _mapcount)
>   
> If prototypes work fine, then we can select the method.

I think the first one would work well and it's more accurate in
meaning of free page.

Although this might be not problematic in practice, new method that
walks on page tables can lead to different result from the previous
one that looks up free_list: looking at __free_pages(), it first
decreases page->_count and then add the page to free_list, and looking
at __alloc_pages(), it first retrieves a page from free_list and then
set page->_count to 1.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-11  8:04       ` Michael Holzheu
@ 2012-04-12  8:49         ` Atsushi Kumagai
  0 siblings, 0 replies; 34+ messages in thread
From: Atsushi Kumagai @ 2012-04-12  8:49 UTC (permalink / raw)
  To: holzheu; +Cc: dzickus, oomichi, kexec

Hello Michael,

On Wed, 11 Apr 2012 10:04:03 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> > bitmap file is main cause of memory consuming if 2nd kernel uses initramfs
> > only. There are other parts where the size of allocated memory varies based
> > on 1-st kernel memory size, but they don't have big influence.
> 
> Thanks for the explanation. 
> 
> I ask because I want to exactly calculate the required size for the
> crashkernel parameter. On s390 the kdump kernel memory consumption is
> fix and not dependent on the 1st kernel memory size. So based on your
> explanation I, will use:
> 
> crashkernel=<base size> + <variable size>
> 
> where
> 
> <variable size> = <pages of 1st kernel> * (2 + x) / 8
> 
> where "x" is the variable makedumpfile memory allocation that is on top
> of the bitmap allocation. What would be a good value for "x"?

I'm sorry that I don't have the exact number, but even the second biggest
memory allocation would require under 1/100 of the bitmap size, so, I think
0.01 is usually enough for "x".

Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
       [not found]                               ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
@ 2012-04-27 12:52                                 ` Don Zickus
  2012-05-11  1:19                                   ` Atsushi Kumagai
  2012-04-27 13:33                                 ` Vivek Goyal
  2012-05-14  5:44                                 ` HATAYAMA Daisuke
  2 siblings, 1 reply; 34+ messages in thread
From: Don Zickus @ 2012-04-27 12:52 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal

On Fri, Apr 27, 2012 at 04:46:49PM +0900, Atsushi Kumagai wrote:
> Hello,
> 
> On Thu, 12 Apr 2012 16:47:14 +0900 (JST)
> HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> 
> [..]
> > > I said I want to avoid changing behavior based on kernel versions, 
> > > but it seems difficult as Vivek said. So, I will accept the changing
> > > if it is necessary.
> > > 
> > > Now, I will make two prototypes to consider the method to figure out
> > > free pages.
> > > 
> > >   - a prototype based on _count
> > >   - a prototype based on PG_buddy (or _mapcount)
> > >   
> > > If prototypes work fine, then we can select the method.
> > 
> > I think the first one would work well and it's more accurate in
> > meaning of free page.
> > 
> > Although this might be not problematic in practice, new method that
> > walks on page tables can lead to different result from the previous
> > one that looks up free_list: looking at __free_pages(), it first
> > decreases page->_count and then add the page to free_list, and looking
> > at __alloc_pages(), it first retrieves a page from free_list and then
> > set page->_count to 1.
> 
> I tested the prototype based on _count and the other based on _mapcount.
> So, the former didn't work as expected while the latter worked fine.
> (The former excluded some used pages as free pages.)
> 
> As a next step, I measured performance of the prototype based on _mapcount,
> please see below.

Thanks for this work.  I assume this work just switches the free page
referencing and does not attempt to try and cut down on the memory usage
(I guess that would be the next step if using mapcount is acceptable)?

> 
> 
> Performance Comparison:
> 
>   Explanation:
>     - The new method supports 2.6.39 and later, and it needs vmlinux.
> 
>     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
>       is different depending on kernel configuration and it isn't stored into 
>       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
>       when the value of PG_buddy is stored into VMCOREINFO.
> 
>     - The prototype has dump_level "32" to use new method, but I don't think
>       to extend dump_level for official version.
> 
>   How to measure:
>     I measured execution times with vmcore of 5GB in below cases with 
>     attached patches.
> 
>       - dump_level 16: exclude only free pages with the current method
>       - dump_level 31: exclude all excludable pages with the current method
>       - dump_level 32: exclude only free pages with the new method
>       - dump_level 47: exclude all excludable pages with the new method
> 
>   Result:
>      ------------------------------------------------------------------------
>      dump_level	     size [Bytes]    total time	   d_all_time     d_new_time	
>      ------------------------------------------------------------------------
>      	16		431864384	28.6s	     4.19s	      0s
>      	31		111808568	14.5s	      0.9s	      0s
>      	32		431864384	41.2s	     16.8s	   0.05s
>      	47		111808568	31.5s	     16.6s	   0.05s
>      ------------------------------------------------------------------------
> 
>   Discussion:
>     I think the new method can be used instead of the current method in many cases.
>     (However, the result of dump_level 31 looks too fast, I'm researching why
>     the case can execute so fast.)
> 
>     I would like to get your opinion.

I am curious.  Looking through your patches, it seems d_all_time's
increase in time should be from the new method because the if-statement is
setup to only accept the new method.  Therefore I was expecting d_new_time
for the new method when added to d_all_time for the current method would
come close to d_all_time for the new method.  IOW I would have expected
the extra 10-12 seconds from the new method to be found in d_new_time.

However, I do not see that.  d_new_time hardly increases at all.  So what
is accounting for the increase in d_all_time for the new method?

Thanks,
Don

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
       [not found]                               ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
  2012-04-27 12:52                                 ` Don Zickus
@ 2012-04-27 13:33                                 ` Vivek Goyal
  2012-05-14  5:44                                 ` HATAYAMA Daisuke
  2 siblings, 0 replies; 34+ messages in thread
From: Vivek Goyal @ 2012-04-27 13:33 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: dzickus, oomichi, d.hatayama, kexec

On Fri, Apr 27, 2012 at 04:46:49PM +0900, Atsushi Kumagai wrote:
> Hello,
> 
> On Thu, 12 Apr 2012 16:47:14 +0900 (JST)
> HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> 
> [..]
> > > I said I want to avoid changing behavior based on kernel versions, 
> > > but it seems difficult as Vivek said. So, I will accept the changing
> > > if it is necessary.
> > > 
> > > Now, I will make two prototypes to consider the method to figure out
> > > free pages.
> > > 
> > >   - a prototype based on _count
> > >   - a prototype based on PG_buddy (or _mapcount)
> > >   
> > > If prototypes work fine, then we can select the method.
> > 
> > I think the first one would work well and it's more accurate in
> > meaning of free page.
> > 
> > Although this might be not problematic in practice, new method that
> > walks on page tables can lead to different result from the previous
> > one that looks up free_list: looking at __free_pages(), it first
> > decreases page->_count and then add the page to free_list, and looking
> > at __alloc_pages(), it first retrieves a page from free_list and then
> > set page->_count to 1.
> 
> I tested the prototype based on _count and the other based on _mapcount.
> So, the former didn't work as expected while the latter worked fine.
> (The former excluded some used pages as free pages.)
> 
> As a next step, I measured performance of the prototype based on _mapcount,
> please see below.
> 
> 
> Performance Comparison:
> 
>   Explanation:
>     - The new method supports 2.6.39 and later, and it needs vmlinux.
> 
>     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
>       is different depending on kernel configuration and it isn't stored into 
>       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
>       when the value of PG_buddy is stored into VMCOREINFO.
> 
>     - The prototype has dump_level "32" to use new method, but I don't think
>       to extend dump_level for official version.

Thanks for your work. Yes, introducing new dump_level for new filtering
method will not be appropriate. If it is found that going through struct
pages and parsing _mapcount is not too bad from performance point of view,
then makedumpfile should just switch its default on newer kernels. 

Or, I am assuming that anyway we will intorduce a new option to
makedumpfile to tell whether we want to a fixed memory usage filtering
or not (assuming there is significant performance penalty on large
machines, 1TB or more). So with that option we can do free page filtering
using struct page otherwise we can continue to go through free pages
list.

Anyway, I think it is too early to discuss various user visible options.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-04-27 12:52                                 ` Don Zickus
@ 2012-05-11  1:19                                   ` Atsushi Kumagai
  2012-05-11 13:26                                     ` Don Zickus
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-05-11  1:19 UTC (permalink / raw)
  To: kexec; +Cc: dzickus, oomichi, d.hatayama, vgoyal

Hello, 

On Fri, 27 Apr 2012 08:52:14 -0400
Don Zickus <dzickus@redhat.com> wrote:

[..]
> > I tested the prototype based on _count and the other based on _mapcount.
> > So, the former didn't work as expected while the latter worked fine.
> > (The former excluded some used pages as free pages.)
> > 
> > As a next step, I measured performance of the prototype based on _mapcount,
> > please see below.
> 
> Thanks for this work.  I assume this work just switches the free page
> referencing and does not attempt to try and cut down on the memory usage
> (I guess that would be the next step if using mapcount is acceptable)?

Thank you for your reply, Don, Vivek.

As Don said, I tried to change the method to exclude free pages and
planed to resolve the memory consumption issue after it, because
parsing free list repeatedly may cause a performance issue.

However, I'm thinking that to fix the size of memory consumption is more
important than to resolve a performance issue for large system.

So I'm afraid that I would like to change the plan as:

  1. Implement "iterating filtering processing" to fix the size of memory
     consumption. At this stage, makedumpfile will parse free list repeatedly
     even though it may cause a performance issue.
     
  2. Take care of the performance issue after the 1st step.


Thanks
Atsushi Kumagai

> 
> > 
> > 
> > Performance Comparison:
> > 
> >   Explanation:
> >     - The new method supports 2.6.39 and later, and it needs vmlinux.
> > 
> >     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
> >       is different depending on kernel configuration and it isn't stored into 
> >       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
> >       when the value of PG_buddy is stored into VMCOREINFO.
> > 
> >     - The prototype has dump_level "32" to use new method, but I don't think
> >       to extend dump_level for official version.
> > 
> >   How to measure:
> >     I measured execution times with vmcore of 5GB in below cases with 
> >     attached patches.
> > 
> >       - dump_level 16: exclude only free pages with the current method
> >       - dump_level 31: exclude all excludable pages with the current method
> >       - dump_level 32: exclude only free pages with the new method
> >       - dump_level 47: exclude all excludable pages with the new method
> > 
> >   Result:
> >      ------------------------------------------------------------------------
> >      dump_level	     size [Bytes]    total time	   d_all_time     d_new_time	
> >      ------------------------------------------------------------------------
> >      	16		431864384	28.6s	     4.19s	      0s
> >      	31		111808568	14.5s	      0.9s	      0s
> >      	32		431864384	41.2s	     16.8s	   0.05s
> >      	47		111808568	31.5s	     16.6s	   0.05s
> >      ------------------------------------------------------------------------
> > 
> >   Discussion:
> >     I think the new method can be used instead of the current method in many cases.
> >     (However, the result of dump_level 31 looks too fast, I'm researching why
> >     the case can execute so fast.)
> > 
> >     I would like to get your opinion.
> 
> I am curious.  Looking through your patches, it seems d_all_time's
> increase in time should be from the new method because the if-statement is
> setup to only accept the new method.  Therefore I was expecting d_new_time
> for the new method when added to d_all_time for the current method would
> come close to d_all_time for the new method.  IOW I would have expected
> the extra 10-12 seconds from the new method to be found in d_new_time.
> 
> However, I do not see that.  d_new_time hardly increases at all.  So what
> is accounting for the increase in d_all_time for the new method?
> 
> Thanks,
> Don
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-05-11  1:19                                   ` Atsushi Kumagai
@ 2012-05-11 13:26                                     ` Don Zickus
  2012-05-15  5:57                                       ` Atsushi Kumagai
  0 siblings, 1 reply; 34+ messages in thread
From: Don Zickus @ 2012-05-11 13:26 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal

On Fri, May 11, 2012 at 10:19:52AM +0900, Atsushi Kumagai wrote:
> Hello, 
> 
> On Fri, 27 Apr 2012 08:52:14 -0400
> Don Zickus <dzickus@redhat.com> wrote:
> 
> [..]
> > > I tested the prototype based on _count and the other based on _mapcount.
> > > So, the former didn't work as expected while the latter worked fine.
> > > (The former excluded some used pages as free pages.)
> > > 
> > > As a next step, I measured performance of the prototype based on _mapcount,
> > > please see below.
> > 
> > Thanks for this work.  I assume this work just switches the free page
> > referencing and does not attempt to try and cut down on the memory usage
> > (I guess that would be the next step if using mapcount is acceptable)?
> 
> Thank you for your reply, Don, Vivek.
> 
> As Don said, I tried to change the method to exclude free pages and
> planed to resolve the memory consumption issue after it, because
> parsing free list repeatedly may cause a performance issue.
> 
> However, I'm thinking that to fix the size of memory consumption is more
> important than to resolve a performance issue for large system.
> 
> So I'm afraid that I would like to change the plan as:
> 
>   1. Implement "iterating filtering processing" to fix the size of memory
>      consumption. At this stage, makedumpfile will parse free list repeatedly
>      even though it may cause a performance issue.
>      
>   2. Take care of the performance issue after the 1st step.

Hello Atsushi-san,

Hmm.  The problem with the free list is that the addresses are in random
order, hence the reason to parse it repeatedly, correct?

I figured, now that you have a solution to parse the addresses in a linear
way (the changes you made a couple of weeks ago), you would just continue
with that.  With that complete, we can look at the performance issues and
solve them then.

But it is up to you.  You are willing to do the work, so I will defer to
your judgement on how best to proceed. :-)

Cheers,
Don

> 
> 
> Thanks
> Atsushi Kumagai
> 
> > 
> > > 
> > > 
> > > Performance Comparison:
> > > 
> > >   Explanation:
> > >     - The new method supports 2.6.39 and later, and it needs vmlinux.
> > > 
> > >     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
> > >       is different depending on kernel configuration and it isn't stored into 
> > >       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
> > >       when the value of PG_buddy is stored into VMCOREINFO.
> > > 
> > >     - The prototype has dump_level "32" to use new method, but I don't think
> > >       to extend dump_level for official version.
> > > 
> > >   How to measure:
> > >     I measured execution times with vmcore of 5GB in below cases with 
> > >     attached patches.
> > > 
> > >       - dump_level 16: exclude only free pages with the current method
> > >       - dump_level 31: exclude all excludable pages with the current method
> > >       - dump_level 32: exclude only free pages with the new method
> > >       - dump_level 47: exclude all excludable pages with the new method
> > > 
> > >   Result:
> > >      ------------------------------------------------------------------------
> > >      dump_level	     size [Bytes]    total time	   d_all_time     d_new_time	
> > >      ------------------------------------------------------------------------
> > >      	16		431864384	28.6s	     4.19s	      0s
> > >      	31		111808568	14.5s	      0.9s	      0s
> > >      	32		431864384	41.2s	     16.8s	   0.05s
> > >      	47		111808568	31.5s	     16.6s	   0.05s
> > >      ------------------------------------------------------------------------
> > > 
> > >   Discussion:
> > >     I think the new method can be used instead of the current method in many cases.
> > >     (However, the result of dump_level 31 looks too fast, I'm researching why
> > >     the case can execute so fast.)
> > > 
> > >     I would like to get your opinion.
> > 
> > I am curious.  Looking through your patches, it seems d_all_time's
> > increase in time should be from the new method because the if-statement is
> > setup to only accept the new method.  Therefore I was expecting d_new_time
> > for the new method when added to d_all_time for the current method would
> > come close to d_all_time for the new method.  IOW I would have expected
> > the extra 10-12 seconds from the new method to be found in d_new_time.
> > 
> > However, I do not see that.  d_new_time hardly increases at all.  So what
> > is accounting for the increase in d_all_time for the new method?
> > 
> > Thanks,
> > Don
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
       [not found]                               ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
  2012-04-27 12:52                                 ` Don Zickus
  2012-04-27 13:33                                 ` Vivek Goyal
@ 2012-05-14  5:44                                 ` HATAYAMA Daisuke
  2012-05-16  8:02                                   ` Atsushi Kumagai
  2 siblings, 1 reply; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-05-14  5:44 UTC (permalink / raw)
  To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Fri, 27 Apr 2012 16:46:49 +0900

>     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
>       is different depending on kernel configuration and it isn't stored into 
>       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
>       when the value of PG_buddy is stored into VMCOREINFO.

Hello Kumagai san,

I'm now investigating how to perform filtering free pages without
kernel debuginfo. For this, I've investigated which of PG_buddy and
_mapcount to use in kernel versions. In the current conclusion, it's
reasonable to do that as shown in the following table.

| kernel version   |  Use PG_buddy? or _mapcount?                             |
|------------------+----------------------------------------------------------|
| 2.6.15 -- 2.6.16 | offsetof(page,_mapcount):=sizeof(ulong)+sizeof(atomic_t) |
| 2.6.17 -- 2.6.26 |        PG_buddy := 19                                    |
| 2.6.27 -- 2.6.36 |        PG_buddy := 18                                    |
| 2.6.37 and later | offsetof(page,_mapcount):= under investigation           |                                           |

In summary: PG_buddy was first introduced at 2.6.17 as 19 to fix some
race bug leading to lru list corruptions, and from 2.6.17 to 2.6.26,
it had been defined using macro preprocessor. At 2.6.27 enum pageflags
was introduced for ease of page flags maintainance and its value
changed to 18. At 2.6.37, it was removed, and it no longer exists in
later kernel versions.

My quick feeling is that solving dependency of PG_buddy is simler than
that of _mapcount from 2.6.17 to 2.6.36.

From 2.6.15 to 2.6.16, PG_buddy has not been introduced so we need to
rely on _mapcount. It's very complex to solve _mapcount dependency in
general on all supported kernel versions, but only on both kernel
versions, definition of struct page begins with the following
layout. I think it's not so much complex to hardcode offset of
_mapcount for these two kernel versions only: that is, sizeof(unsigned
long) + sizeof(atomic_t) which is in fact struct { volatile int
counter } on all platforms.

struct page {
        unsigned long flags;            /* Atomic flags, some possibly
                                         * updated asynchronously */
        atomic_t _count;                /* Usage count, see below. */
        atomic_t _mapcount;             /* Count of ptes mapped in mms,
...

In the period of PG_buddy is defined as enumeration value, PG_buddy
value depends on CONFIG_PAGEFLAGS_EXTENDED. At commit
e20b8cca760ed2a6abcfe37ef56f2306790db648, PG_head and PG_tail were
introduced and they are positioned before PG_buddy if
CONFIG_PAGEFLAGS_EXTENDED is set; then PG_buddy value becomes
19. However, its users are mips, um and xtensa only as:

  $ git grep "CONFIG_PAGEFLAGS_EXTENDED"
  arch/mips/configs/db1300_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
  arch/um/defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
  arch/xtensa/configs/iss_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
  arch/xtensa/configs/s6105_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
  include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED
  include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED
  mm/memory-failure.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED
  mm/page_alloc.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED

and makedumpfile doesn't support any of these platforms now. So we
don't need to consider this case more.

On 2.6.37 and the later kernels, we must use _mapcount. I'm now
looking into how to get offset of _mapcount in each kernel version
without kernel debug information. But page structure has changed
considerably on recent kernels so I guess the way hardcoding them gets
more complicated.

Anyway, I think it better to add _mapcount information to VMCOREINFO
on upstream as soon as possible.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-05-11 13:26                                     ` Don Zickus
@ 2012-05-15  5:57                                       ` Atsushi Kumagai
  2012-05-15 12:35                                         ` Don Zickus
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-05-15  5:57 UTC (permalink / raw)
  To: dzickus; +Cc: oomichi, d.hatayama, kexec, vgoyal

Hello Don,

On Fri, 11 May 2012 09:26:01 -0400
Don Zickus <dzickus@redhat.com> wrote:

> > Thank you for your reply, Don, Vivek.
> > 
> > As Don said, I tried to change the method to exclude free pages and
> > planed to resolve the memory consumption issue after it, because
> > parsing free list repeatedly may cause a performance issue.
> > 
> > However, I'm thinking that to fix the size of memory consumption is more
> > important than to resolve a performance issue for large system.
> > 
> > So I'm afraid that I would like to change the plan as:
> > 
> >   1. Implement "iterating filtering processing" to fix the size of memory
> >      consumption. At this stage, makedumpfile will parse free list repeatedly
> >      even though it may cause a performance issue.
> >      
> >   2. Take care of the performance issue after the 1st step.
> 
> Hello Atsushi-san,
> 
> Hmm.  The problem with the free list is that the addresses are in random
> order, hence the reason to parse it repeatedly, correct?

Yes.

> I figured, now that you have a solution to parse the addresses in a linear
> way (the changes you made a couple of weeks ago), you would just continue
> with that.  With that complete, we can look at the performance issues and
> solve them then.
> 
> But it is up to you.  You are willing to do the work, so I will defer to
> your judgement on how best to proceed. :-)

What I wanted to tell you was I want to resolve the memory consumption issue
as soon as possible. In other words, I think the method to exclude free pages
is not so important.
I'll continue to work with the method which is easy to implement.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-05-15  5:57                                       ` Atsushi Kumagai
@ 2012-05-15 12:35                                         ` Don Zickus
  0 siblings, 0 replies; 34+ messages in thread
From: Don Zickus @ 2012-05-15 12:35 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: oomichi, d.hatayama, kexec, vgoyal

On Tue, May 15, 2012 at 02:57:05PM +0900, Atsushi Kumagai wrote:
> Hello Don,
> 
> On Fri, 11 May 2012 09:26:01 -0400
> Don Zickus <dzickus@redhat.com> wrote:
> 
> > > Thank you for your reply, Don, Vivek.
> > > 
> > > As Don said, I tried to change the method to exclude free pages and
> > > planed to resolve the memory consumption issue after it, because
> > > parsing free list repeatedly may cause a performance issue.
> > > 
> > > However, I'm thinking that to fix the size of memory consumption is more
> > > important than to resolve a performance issue for large system.
> > > 
> > > So I'm afraid that I would like to change the plan as:
> > > 
> > >   1. Implement "iterating filtering processing" to fix the size of memory
> > >      consumption. At this stage, makedumpfile will parse free list repeatedly
> > >      even though it may cause a performance issue.
> > >      
> > >   2. Take care of the performance issue after the 1st step.
> > 
> > Hello Atsushi-san,
> > 
> > Hmm.  The problem with the free list is that the addresses are in random
> > order, hence the reason to parse it repeatedly, correct?
> 
> Yes.
> 
> > I figured, now that you have a solution to parse the addresses in a linear
> > way (the changes you made a couple of weeks ago), you would just continue
> > with that.  With that complete, we can look at the performance issues and
> > solve them then.
> > 
> > But it is up to you.  You are willing to do the work, so I will defer to
> > your judgement on how best to proceed. :-)
> 
> What I wanted to tell you was I want to resolve the memory consumption issue
> as soon as possible. In other words, I think the method to exclude free pages
> is not so important.
> I'll continue to work with the method which is easy to implement.

Ok.  I look forward to your results.  Thanks for your effort.

Cheers,
Don

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-05-14  5:44                                 ` HATAYAMA Daisuke
@ 2012-05-16  8:02                                   ` Atsushi Kumagai
  2012-05-17  0:21                                     ` HATAYAMA Daisuke
  0 siblings, 1 reply; 34+ messages in thread
From: Atsushi Kumagai @ 2012-05-16  8:02 UTC (permalink / raw)
  To: d.hatayama; +Cc: dzickus, oomichi, kexec, vgoyal

Hello HATAYAMA-san,

On Mon, 14 May 2012 14:44:28 +0900 (JST)
HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

> From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> Subject: Re: makedumpfile memory usage grows with system memory size
> Date: Fri, 27 Apr 2012 16:46:49 +0900
> 
> >     - Now, the prototype doesn't support PG_buddy because the value of PG_buddy
> >       is different depending on kernel configuration and it isn't stored into 
> >       VMCOREINFO. However, I'll extend get_length_of_free_pages() for PG_buddy 
> >       when the value of PG_buddy is stored into VMCOREINFO.
> 
> Hello Kumagai san,
> 
> I'm now investigating how to perform filtering free pages without
> kernel debuginfo. For this, I've investigated which of PG_buddy and
> _mapcount to use in kernel versions. In the current conclusion, it's
> reasonable to do that as shown in the following table.
> 
> | kernel version   |  Use PG_buddy? or _mapcount?                             |
> |------------------+----------------------------------------------------------|
> | 2.6.15 -- 2.6.16 | offsetof(page,_mapcount):=sizeof(ulong)+sizeof(atomic_t) |
> | 2.6.17 -- 2.6.26 |        PG_buddy := 19                                    |
> | 2.6.27 -- 2.6.36 |        PG_buddy := 18                                    |
> | 2.6.37 and later | offsetof(page,_mapcount):= under investigation           |                                           |

Thank you for your investigation, it's very helpful !

> In summary: PG_buddy was first introduced at 2.6.17 as 19 to fix some
> race bug leading to lru list corruptions, and from 2.6.17 to 2.6.26,
> it had been defined using macro preprocessor. At 2.6.27 enum pageflags
> was introduced for ease of page flags maintainance and its value
> changed to 18. At 2.6.37, it was removed, and it no longer exists in
> later kernel versions.
> 
> My quick feeling is that solving dependency of PG_buddy is simler than
> that of _mapcount from 2.6.17 to 2.6.36.
> 
> From 2.6.15 to 2.6.16, PG_buddy has not been introduced so we need to
> rely on _mapcount. It's very complex to solve _mapcount dependency in
> general on all supported kernel versions, but only on both kernel
> versions, definition of struct page begins with the following
> layout. I think it's not so much complex to hardcode offset of
> _mapcount for these two kernel versions only: that is, sizeof(unsigned
> long) + sizeof(atomic_t) which is in fact struct { volatile int
> counter } on all platforms.
> 
> struct page {
>         unsigned long flags;            /* Atomic flags, some possibly
>                                          * updated asynchronously */
>         atomic_t _count;                /* Usage count, see below. */
>         atomic_t _mapcount;             /* Count of ptes mapped in mms,
> ...
> 
> In the period of PG_buddy is defined as enumeration value, PG_buddy
> value depends on CONFIG_PAGEFLAGS_EXTENDED. At commit
> e20b8cca760ed2a6abcfe37ef56f2306790db648, PG_head and PG_tail were
> introduced and they are positioned before PG_buddy if
> CONFIG_PAGEFLAGS_EXTENDED is set; then PG_buddy value becomes
> 19. However, its users are mips, um and xtensa only as:
> 
>   $ git grep "CONFIG_PAGEFLAGS_EXTENDED"
>   arch/mips/configs/db1300_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
>   arch/um/defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
>   arch/xtensa/configs/iss_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
>   arch/xtensa/configs/s6105_defconfig:CONFIG_PAGEFLAGS_EXTENDED=y
>   include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED
>   include/linux/page-flags.h:#ifdef CONFIG_PAGEFLAGS_EXTENDED
>   mm/memory-failure.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED
>   mm/page_alloc.c:#ifdef CONFIG_PAGEFLAGS_EXTENDED
> 
> and makedumpfile doesn't support any of these platforms now. So we
> don't need to consider this case more.
> 
> On 2.6.37 and the later kernels, we must use _mapcount. I'm now
> looking into how to get offset of _mapcount in each kernel version
> without kernel debug information. But page structure has changed
> considerably on recent kernels so I guess the way hardcoding them gets
> more complicated.
> 
> Anyway, I think it better to add _mapcount information to VMCOREINFO
> on upstream as soon as possible.

I think it's better way to use _mapcount. 
But we don't certainly decide to use _mapcount and even if we decide to use it,
we still have problems to use it.
For example, the upstream kernel(v3.4-rc7) has _mapcount in union, we need
a information to judge whether the found data is _mapcount or not. 
So, more investigation is needed and I think it's too early to send the request
to upstream kernel.

I plan to finish working to reduce memory consumption by the end of June, 
and I will continue to discuss performance issues.
Therefore, the request will be delayed until July or August.


Thanks
Atsushi Kumagai

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
  2012-05-16  8:02                                   ` Atsushi Kumagai
@ 2012-05-17  0:21                                     ` HATAYAMA Daisuke
  0 siblings, 0 replies; 34+ messages in thread
From: HATAYAMA Daisuke @ 2012-05-17  0:21 UTC (permalink / raw)
  To: kumagai-atsushi; +Cc: dzickus, oomichi, kexec, vgoyal

From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Wed, 16 May 2012 17:02:30 +0900

> On Mon, 14 May 2012 14:44:28 +0900 (JST)
> HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:

>> Anyway, I think it better to add _mapcount information to VMCOREINFO
>> on upstream as soon as possible.
> 
> I think it's better way to use _mapcount. 
> But we don't certainly decide to use _mapcount and even if we decide to use it,
> we still have problems to use it.
> For example, the upstream kernel(v3.4-rc7) has _mapcount in union, we need
> a information to judge whether the found data is _mapcount or not. 
> So, more investigation is needed and I think it's too early to send the request
> to upstream kernel.

A quick look into the other part of the union---inuse, objects,
froze---to which _mapcount belongs, they appears to be used in SLUB
allocator. The page with PG_slab appears to use the other part to
_mapcount. This means that to decide whether to use _mapcount, it's
necessary to first investigate how SLUB allocator works.

> 
> I plan to finish working to reduce memory consumption by the end of June, 
> and I will continue to discuss performance issues.
> Therefore, the request will be delayed until July or August.
> 

I'll wait for your patch on memory consumption and feedback to
it. I'll also look into possibility of filtering free memory in
constant space on recent kernels.

Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: makedumpfile memory usage grows with system memory size
@ 2012-04-02  6:53 tachibana
  0 siblings, 0 replies; 34+ messages in thread
From: tachibana @ 2012-04-02  6:53 UTC (permalink / raw)
  To: Don Zickus; +Cc: kexec

Hi Don,

On 2012/03/30 09:19:16 -0400, Don Zickus <dzickus@redhat.com> wrote:
> On Fri, Mar 30, 2012 at 06:43:34PM +0900, Atsushi Kumagai wrote:
> > Hello Don,
> > Does setting TMPDIR solve your problem ? Please refer to the man page.
> > 
> > 
> >     ENVIRONMENT VARIABLES
> >            TMPDIR  This  environment  variable  is  for  a temporary memory bitmap
> >                    file.  If your machine has a lots of memory and you  use  tmpfs
> >                    on  /tmp,  makedumpfile can fail for a little memory in the 2nd
> >                    kernel because makedumpfile makes a very large temporary memory
> >                    bitmap  file in this case. To avoid this failure, you can set a
> >                    TMPDIR environment variable. If you do not set a  TMPDIR  envi-
> >                    ronment variable, makedumpfile uses /tmp directory for a tempo-
> >                    rary bitmap file as a default.
> 
> I do not think it will because we run the second kernel inside the
> initramfs and do not mount any extra disks.  So the only location available
> for the temporary memory bitmap would be memory either tmpfs or something
> else.  Regardless the file ends up in memory.

If a file system for a dump file is on the local system, it is effective that
we specify a directory as TMPDIR in the same file system, isn't it?


Thanks
tachibana

> 
> > 
> > 
> > On the other hand, I'm considering the enhancement suggested by Hatayama-san now.
> 
> His idea looks interesting if it works.  Thanks.
> 
> Cheers,
> Don
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-05-17  0:21 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-28 21:22 makedumpfile memory usage grows with system memory size Don Zickus
2012-03-29  8:09 ` Ken'ichi Ohmichi
2012-03-29 12:56   ` HATAYAMA Daisuke
2012-03-29 13:25     ` Don Zickus
2012-03-30  0:51       ` HATAYAMA Daisuke
2012-04-02  7:46         ` Atsushi Kumagai
2012-04-05  6:52           ` HATAYAMA Daisuke
2012-04-05 14:34             ` Vivek Goyal
2012-04-06  1:12               ` HATAYAMA Daisuke
2012-04-06  8:59                 ` Atsushi Kumagai
2012-04-06  9:29                   ` HATAYAMA Daisuke
2012-04-09 18:57                     ` Vivek Goyal
2012-04-09 23:58                       ` HATAYAMA Daisuke
2012-04-10 12:52                         ` Vivek Goyal
2012-04-12  3:40                           ` Atsushi Kumagai
2012-04-12  7:47                             ` HATAYAMA Daisuke
     [not found]                               ` <20120427164649.9932a33f.kumagai-atsushi@mxc.nes.nec.co.jp>
2012-04-27 12:52                                 ` Don Zickus
2012-05-11  1:19                                   ` Atsushi Kumagai
2012-05-11 13:26                                     ` Don Zickus
2012-05-15  5:57                                       ` Atsushi Kumagai
2012-05-15 12:35                                         ` Don Zickus
2012-04-27 13:33                                 ` Vivek Goyal
2012-05-14  5:44                                 ` HATAYAMA Daisuke
2012-05-16  8:02                                   ` Atsushi Kumagai
2012-05-17  0:21                                     ` HATAYAMA Daisuke
2012-04-09 19:00                 ` Vivek Goyal
2012-03-29 13:05   ` Don Zickus
2012-03-30  9:43     ` Atsushi Kumagai
2012-03-30 13:19       ` Don Zickus
2012-04-02 17:15   ` Michael Holzheu
2012-04-06  8:09     ` Atsushi Kumagai
2012-04-11  8:04       ` Michael Holzheu
2012-04-12  8:49         ` Atsushi Kumagai
2012-04-02  6:53 tachibana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.