[Follow-up] Physical memory disappeared from /proc/meminfo

* [Follow-up] Physical memory disappeared from /proc/meminfo
@ 2008-08-17 17:59 Marc Villemade
  2008-08-17 18:59 ` Rik van Riel
  2008-08-18  6:09 ` Andi Kleen
  0 siblings, 2 replies; 6+ messages in thread
From: Marc Villemade @ 2008-08-17 17:59 UTC (permalink / raw)
  To: linux-kernel

Hi everyone,

(I apologize in advance for this long email)

While looking for answers to my memory problems i've been having for
some time now, i've stumbled onto these posts:

Dated last year :
http://kerneltrap.org/mailarchive/linux-kernel/2007/8/26/164909

and dated from a couple months ago:
http://kerneltrap.org/mailarchive/linux-kernel/2008/6/24/2209554

I'm having exactly the same issue but on a 2.6.20.4 vanilla kernel
(x86). /proc/meminfo shows that
MemFree+Buffers+cached+AnonPages+Slab+Mapped != MemTotal, which AFAIK
should be the case.

6_days_uptime_machine ~ # cat /proc/meminfo
MemTotal:      3106668 kB
MemFree:        678104 kB
Buffers:        120024 kB
Cached:          69892 kB
SwapCached:          0 kB
Active:         740872 kB
Inactive:      1621704 kB
HighTotal:     2227996 kB
HighFree:        21380 kB
LowTotal:       878672 kB
LowFree:        656724 kB
SwapTotal:     4192956 kB
SwapFree:      4192956 kB
Dirty:            1292 kB
Writeback:           0 kB
AnonPages:      586900 kB
Mapped:          13824 kB
Slab:            50432 kB
SReclaimable:    39092 kB
SUnreclaim:      11340 kB
PageTables:       1532 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   5746288 kB
Committed_AS:  1073624 kB
VmallocTotal:   114680 kB
VmallocUsed:      8944 kB
VmallocChunk:   105568 kB

in contrast, here's the meminfo from a machine that has been rebooted
20 hours before, in which the above mentioned figures almost add up to
MemTotal. There's 50 MB missing already which makes me think the leak
starts right away after boot up...

20_hours_uptime_machine ~ # cat /proc/meminfo
MemTotal:      3106668 kB
MemFree:       2455932 kB
Buffers:         88624 kB
Cached:          69364 kB
SwapCached:          0 kB
Active:         496772 kB
Inactive:       114680 kB
HighTotal:     2227996 kB
HighFree:      1695240 kB
LowTotal:       878672 kB
LowFree:        760692 kB
SwapTotal:     4192956 kB
SwapFree:      4192956 kB
Dirty:            1016 kB
Writeback:           0 kB
AnonPages:      395888 kB
Mapped:          13956 kB
Slab:            23400 kB
SReclaimable:    12828 kB
SUnreclaim:      10572 kB
PageTables:       1048 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   5746288 kB
Committed_AS:   988928 kB
VmallocTotal:   114680 kB
VmallocUsed:      8944 kB
VmallocChunk:   105568 kB

Over time, i've noticed that the LRU lists (active.inactive) just get
bigger and bigger and inactive especially never seems to get freed
which doesn't make a lot of sense to me. I've tried the drop_cache
thing which helps for a while (but still doesn't make the memory
accounting get back to normal), but it is not a fix, it's only a
temporary solution. I'd like to have those machines running without us
having to drop caches once in a while.

The main brain-teaser for me here is that the machines were in use
several months ago in another setup almost identical in terms of
running processes - same kernel, same running processes, only
differences are mostly network-wise and machines have not been
reinstalled - and we didn't have this kind of issues. Now, we have to
reboot the servers every other week otherwise applications just get
refused for more memory at one point. That is inexplicable to me !
Which is why i turn to you guys ;)

Looking at meminfo, something else strikes me :  if SwapCached means
that there was once something swapped out and since it is always 0 on
my machines, how can a machine apparently going out of memory, and on
which swap is on, never swaps anything ? It seems logical to me that
one can't have more memory than the system can allocate which would
make swap space on a 32 bit machine with 4Gb of RAM useless, if it
were not for the MMU. Those machines have MMU enabled (hence the 3Gb
available even though 4 are physically installed). So i should be able
to use swap. Hence why doesn't it seem to be the case when the
machines are likely running out of memory (refusing malloc() calls).
Or maybe, i'm just totally wrong about the meaning of SwapCached ??!?

I've browsed (read grep'd) throu the changelogs from 2.6.20.4 and up
to 2.6.26.3 and saw that there was a consequent amount of memory leaks
fixed during that period, but they were mostly linked to USB ( i don't
have any USB devices on these machines although usbfs is used),
NETFILTER (which i don't use) or on other architectures than x86. i
didn't see anything strikingly matching my setup. Except maybe for
some SCSI bugs (mostly linked to firmwares).

Rob Mueller in June (second refered post) had a 2.6.25.x and he still
had the problem. Would you guys know if 2.6.26 fixes this issue ? Fred
on the first thread i posted says he doesn't have the issue with a
2.6.1 but had it with 2.6.12 and 2.6.20.x.

Here is some more info on the 7 days uptime machine. I didn't include
a dmesg cause this mail is already pretty long, and it doesn't seem to
me that there is anything worth of interest in it, but i could be
totally wrong, so please let me know if you want me to send it to you
as well. I might just copy a couple of lines which look a bit
suspicious to me :

------------- from DMESG ---
PM: Writing back config space on device 0000:08:03.0 at offset 3 (was
804000, writing 804010)
PM: Writing back config space on device 0000:08:03.0 at offset 2 (was
2000000, writing 2000010)
PM: Writing back config space on device 0000:08:03.0 at offset 1 (was
2b00000, writing 2b00146)

------------------------------------------------ ZONEINFO

6_days_uptime_machine ~ # cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     2827
        min      17
        low      21
        high     25
        active   0
        inactive 0
        scanned  0 (a: 9 i: 9)
        spanned  4096
        present  4064
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 873, 4048)
  pagesets
  all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     161364
        min      936
        low      1170
        high     1404
        active   33965
        inactive 7245
        scanned  0 (a: 0 i: 28)
        spanned  225280
        present  223520
    nr_anon_pages 5081
    nr_mapped    0
    nr_file_pages 31868
    nr_slab_reclaimable 9773
    nr_slab_unreclaimable 2790
    nr_page_table_pages 383
    nr_dirty     18
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 3
        protection: (0, 0, 25400)
  pagesets
    cpu: 0 pcp: 0
              count: 140
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 23
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 1 pcp: 0
              count: 19
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 14
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 2 pcp: 0
              count: 158
              high:  186
              batch: 31
    cpu: 2 pcp: 1
              count: 10
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 3 pcp: 0
              count: 94
              high:  186
              batch: 31
    cpu: 3 pcp: 1
              count: 7
              high:  62
              batch: 15
  vm stats threshold: 24
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096
Node 0, zone  HighMem
  pages free     3175
        min      128
        low      979
        high     1831
        active   153497
        inactive 398182
        scanned  0 (a: 0 i: 0)
        spanned  819200
        present  812800
    nr_anon_pages 143849
    nr_mapped    3456
    nr_file_pages 15590
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_dirty     44
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
        protection: (0, 0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 14
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 36
    cpu: 1 pcp: 0
              count: 12
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 3
              high:  62
              batch: 15
  vm stats threshold: 36
    cpu: 2 pcp: 0
              count: 7
              high:  186
              batch: 31
    cpu: 2 pcp: 1
              count: 3
              high:  62
              batch: 15
  vm stats threshold: 36
    cpu: 3 pcp: 0
              count: 33
              high:  186
              batch: 31
    cpu: 3 pcp: 1
              count: 9
              high:  62
              batch: 15
  vm stats threshold: 36
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         229376

------------------------------------------------ LSMOD

6_days_uptime_machine ~ # lsmod
Module                  Size  Used by
iptable_nat             7172  0
nf_nat                 16172  1 iptable_nat
nf_conntrack_ipv4      14860  2 iptable_nat
nf_conntrack           51336  3 iptable_nat,nf_nat,nf_conntrack_ipv4
nfnetlink               6040  3 nf_nat,nf_conntrack_ipv4,nf_conntrack
iptable_filter          3332  1
ip_tables              11508  2 iptable_nat,iptable_filter
x_tables               12804  2 iptable_nat,ip_tables
rtc                    11184  0
bonding                84248  0
bnx2                  142960  0
zlib_inflate           15232  1 bnx2
evdev                   9088  0
raid456               119568  0
xor                    15112  1 raid456
tg3                   104712  0
e1000                 121856  0
sata_nv                15496  0
libata                 96164  1 sata_nv
usbhid                 15240  0
ohci_hcd               19852  0
uhci_hcd               22036  0
usb_storage            34312  0
ehci_hcd               28824  0
usbcore               115084  6 usbhid,ohci_hcd,uhci_hcd,usb_storage,ehci_hcd

Thanks for any information you might have that would help me figure
this out. We've been having this problem for two months now, and it's
getting very infuriating not to be able to fix it or even understand
where the problem stems from. If you need any more information, i'd be
happy to hand it to you. Just ask !

Cheers

Marc Villemade

^ permalink raw reply	[flat|nested] 6+ messages in thread