On Fri, Sep 27, 2019 at 08:39:04PM +0000, Kazuhito Hagio wrote:
 > > -----Original Message-----
 > > On Thu, Sep 26, 2019 at 06:41:48PM +0000, Kazuhito Hagio wrote:
 > > 
 > >  > > -----Original Message-----
 > >  > > If info->max_mapnr and pfn_memhole are equal, we divide by zero when
 > >  > > trying determine the 'shrinking' value.
 > >  > >
 > >  > > On the system I saw this error, we arrived at this function with
 > >  > > info->max_mapnr:0x0000000001080000 pfn_memhole:0x0000000001080000
 > >  >
 > >  > Thank you for the patch.
 > >  > I suppose that you see the error with the -E option, right?
 > >  >
 > >  > It seems that the -E option has some problems with its statistics,
 > >  > so I'm checking whether there is a better way to fix this.
 > > 
 > > Yes, we use the -E option.
 > > We manage to get useful info from the generated dump after this fix, so
 > > it seems it only affects the statistics output.
 > 
 > OK, the statistics in cyclic mode with the -E option is completely wrong
 > but a possible fix is likely to affect the whole of cyclic processing, so
 > I just cover the hole with your patch and leave the statistics problem as
 > a known issue at this time.  I would revisit it when I have time.
 > 
 > The patch was applied to the devel branch.

While this patch does avoid the divide by zero, some further analysis
shows that there seems to be some deeper problem when we encounter this
'original pages = 0' situation.

Take a look at the attached output from makedumpfile.

Key part in the summary:

[  518.819690] Original pages  : 0x0000000000000000
[  518.828894]   Excluded pages   : 0x0000000003decd15
[  518.838635]     Pages filled with zero  : 0x00000000000210ee
[  518.849920]     Non-private cache pages : 0x000000000000271a
[  518.861218]     Private cache pages     : 0x000000000000da47
[  518.872502]     User process data pages : 0x0000000003d6bdc8
[  518.883786]     Free pages              : 0x000000000004fcfe
[  518.895070]     Hwpoison pages          : 0x0000000000000000
[  518.906356]     Offline pages           : 0x0000000000000000
[  518.917659]   Remaining pages  : 0xfffffffffc2132eb
[  518.927398] Memory Hole     : 0x0000000004080000

In this case, 'remaining pages' has gone negative which looks concerning.

And the crashdump seems corrupt:

'crash' complains:
WARNING: possibly corrupt Elf64_Nhdr: n_namesz: 2079035392 n_descsz: 3 n_type: 1000

vmcore-dmesg complains "Missing the log_buf symbol", even though the makedumpfile log
shows it was present at ffffffff822510a0

Readelf seems to think the notes sections are mangled.

# readelf -n vmcore 

Displaying notes found at file offset 0x00015468 with length 0x0000556c:
  Owner                 Data size       Description
                       0x00000007       Unknown note type: (0x727c79d4)
readelf: vmcore: Warning: Corrupt note: name size is too big: 7beb9000
  (NONE)               0x00000003       Unknown note type: (0x00001000)
readelf: vmcore: Warning: Corrupt note: name size is too big: 55a000
  (NONE)               0x00000000       Unknown note type: (0x00000000)
  (NONE)               0x00000001       Unknown note type: (0x00000007)
readelf: vmcore: Warning: note with invalid namesz and/or descsz found at offset 0x44
readelf: vmcore: Warning:  type: 0xffff8803, namesize: 0x00000000, descsize: 0x7c413000


Any thoughts on where to add additional debugging in makedumpfile ?

	Dave