Re: Segmentation fault with drgn + libkdumpfile

From: "Petr Tesařík" <petr@tesarici.cz>
To: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Omar Sandoval <osandov@osandov.com>, linux-debuggers@vger.kernel.org
Subject: Re: Segmentation fault with drgn + libkdumpfile
Date: Wed, 10 Jan 2024 09:36:00 +0100	[thread overview]
Message-ID: <20240110093600.61005acd@meshulam.tesarici.cz> (raw)
In-Reply-To: <87sf36yni8.fsf@oracle.com>

On Tue, 09 Jan 2024 17:40:15 -0800
Stephen Brennan <stephen.s.brennan@oracle.com> wrote:

> Petr Tesařík <petr@tesarici.cz> writes:
> 
> > On Mon, 8 Jan 2024 21:40:08 +0100
> > Petr Tesařík <petr@tesarici.cz> wrote:
> >  
> >> On Fri, 05 Jan 2024 13:53:15 -0800
> >> Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
> >>   
> >> > Petr Tesařík <petr@tesarici.cz> writes:    
> >> > > On Fri, 05 Jan 2024 10:38:16 -0800
> >> > > Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
> >> > >      
> >> > >> Hi Petr,
> >> > >> 
> >> > >> I recently encountered a segmentation fault with libkdumpfile & drgn
> >> > >> which appears to be related to the cache implementation. I've included
> >> > >> the stack trace at the end of this message, since it's a bit of a longer
> >> > >> one. The exact issue occurred with a test vmcore that I could probably
> >> > >> share with you privately if you'd like. In any case, the reproducer is
> >> > >> fairly straightforward in drgn code:
> >> > >> 
> >> > >> for t in for_each_task(prog):
> >> > >>     prog.stack_trace(t)
> >> > >> for t in for_each_task(prog):
> >> > >>     prog.stack_trace(t)
> >> > >> 
> >> > >> The repetition is required, the segfault only occurs on the second
> >> > >> iteration of the loop. Which, in hindsight, is a textbook sign that the
> >> > >> issue has to do with caching. I'd expect that the issue is specific to
> >> > >> this vmcore, it doesn't reproduce on others.
> >> > >> 
> >> > >> I stuck that into a git bisect script and bisected the libkdumpfile
> >> > >> commit that introduced it:
> >> > >> 
> >> > >> commit 487a8042ea5da580e1fdb5b8f91c8bd7cad05cd6
> >> > >> Author: Petr Tesarik <petr@tesarici.cz>
> >> > >> Date:   Wed Jan 11 22:53:01 2023 +0100
> >> > >> 
> >> > >>     Cache: Calculate eprobe in reinit_entry()
> >> > >> 
> >> > >>     If this function is called to reuse a ghost entry, the probe list
> >> > >>     has not been walked yet, so eprobe is left uninitialized.
> >> > >> 
> >> > >>     This passed the test case, because the correct old value was left
> >> > >>     on stack. Modify the test case to poison the stack.
> >> > >> 
> >> > >>     Signed-off-by: Petr Tesarik <petr@tesarici.cz>
> >> > >> 
> >> > >>  src/kdumpfile/cache.c      |  6 +++++-
> >> > >>  src/kdumpfile/test-cache.c | 13 +++++++++++++
> >> > >>  2 files changed, 18 insertions(+), 1 deletion(-)      
> >> > >
> >> > > This looks like a red herring to me. The cache most likely continues in
> >> > > a corrupted state without this commit, which may mask the issue (until
> >> > > it resurfaces later).      
> >> > 
> >> > I see, that makes a lot of sense.
> >> >     
> >> > >> I haven't yet tried to debug the logic of the cache implementation and
> >> > >> create a patch. I'm totally willing to try that, but I figured I would
> >> > >> send this report to you first, to see if there's something obvious that
> >> > >> sticks out to your eyes.      
> >> > >
> >> > > No, but I should be able to recreate the issue if I get a log of the
> >> > > cache API calls:
> >> > >
> >> > > - cache_alloc() - to know the number of elements
> >> > > - cache_get_entry()
> >> > > - cache_put_entry()
> >> > > - cache_insert()
> >> > > - cache_discard()
> >> > > - cache_flush() - not likely after initialization, but...      
> >> > 
> >> > I went ahead and logged each of these calls as you suggested, I tried to
> >> > log them at the beginning of the function call and always include the
> >> > cache pointer, cache_entry, and the key. I took the resulting log and
> >> > filtered it to just contain the most recently logged cache prior to the
> >> > crash, compressed it, and attached it. For completeness, the patch
> >> > I used is below (applies to tip branch 8254897 ("Merge pull request #78
> >> > from fweimer-rh/c99")).
> >> > 
> >> > I'll also see if I can reproduce it based on the log.    
> >> 
> >> Thank you for the log. I haven't had much time to look at it, but the
> >> first line is a good hint already:
> >> 
> >> 0x56098b68c4c0: cache_alloc(1024, 0)
> >> 
> >> Zero size means the data pointers are managed by the caller, so this
> >> must be the cache of mmap()'ed segments. That's the only cache which
> >> installs a cleanup callback with set_cache_entry_cleanup(). There is
> >> only one call to the cleanup callback for evicted entries in cache.c:
> >> 
> >> 		/* Get an unused cached entry. */
> >> 		if (cs->nuprobe != 0 &&
> >> 		    (cs->nuprec == 0 || cache->nprobe + bias > cache->dprobe))
> >> 			evict = evict_probe(cache, cs);
> >> 		else
> >> 			evict = evict_prec(cache, cs);
> >> 		if (cache->entry_cleanup)
> >> 			cache->entry_cleanup(cache->cleanup_data, evict);
> >> 
> >> The entries can be evicted from the probe partition or from the precious
> >> partition. This might be relevant. Please, can you re-run and log where
> >> the evict entry comes from?  
> >
> > I found some time this morning, and it wouldn't help. Because of a bug
> > in fcache_new(), the number of elements in the cache is big enough that
> > cache entries are never evicted in your case. It's quite weird to hit a
> > cache metadata bug after elements have been inserted. FWIW I am not
> > able to reproduce the bug by replaying the logged file read pattern.
> >
> > Since you have a reliable reproducer, it cannot be a Heisenbug. But it
> > could be caused by the other cache - the cache of decompressed pages.
> > Do you know for sure that lzo1x_decompress_safe() crashes while trying
> > to _read_ from the input buffer, and not while trying to _write_ to the
> > output buffer?  
> 
> Hi Petr,
> 
> Sorry for the delay here, I got pulled into other issues and am trying
> to attend to all my work in a round-robin fashion :)

Hi Stephen,

that's fine, I also work on this only as time permits. ;-)

> The fault is definitely in lzo1x_decompress_safe() *writing* to address
> 0. I fetched debuginfo for all the necessary libraries and we see the
> following stack trace:
> 
> %<-----------------------
> #0  0x00007fcd9adddef3 in lzo1x_decompress_safe (in=<optimized out>,
>     in_len=<optimized out>, out=0x0, out_len=0x7ffdee2c1388, wrkmem=<optimized out>)
>     at src/lzo1x_d.ch:120
> #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
> #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
> #3  0x00007fcd9ae25e44 in diskdump_get_page (pio=0x7ffdee2c1590) at diskdump.c:647
> #4  0x00007fcd9ae32be0 in get_page (pio=0x7ffdee2c1590)
>     at /home/stepbren/repos/libkdumpfile/src/kdumpfile/kdumpfile-priv.h:1512
> #5  0x00007fcd9ae32ed4 in get_page_xlat (pio=0x7ffdee2c1590) at read.c:126
> #6  0x00007fcd9ae32f22 in get_page_maybe_xlat (pio=0x7ffdee2c1590) at read.c:137
> #7  0x00007fcd9ae32fb1 in _kdumpfile_priv_read_locked (ctx=0x55745bfca8f0,
>     as=KDUMP_KVADDR, addr=18446612133360081960, buffer=0x7ffdee2c17df,
>     plength=0x7ffdee2c1698) at read.c:169
> #8  0x00007fcd9ae330dd in kdump_read (ctx=0x55745bfca8f0, as=KDUMP_KVADDR,
>     addr=18446612133360081960, buffer=0x7ffdee2c17df, plength=0x7ffdee2c1698)
>     at read.c:196
> #9  0x00007fcd9afb0cc4 in drgn_read_kdump (buf=0x7ffdee2c17df,
>     address=18446612133360081960, count=4, offset=18446612133360081960,
>     arg=0x55745bfca8f0, physical=false) at ../../libdrgn/kdump.c:73
> %<-----------------------
> 
> In frame 1 where we are calling the decompressor:
> 
> %<-----------------------
> (gdb) frame 1
> #1  0x00007fcd9ae25be1 in diskdump_read_page (pio=0x7ffdee2c1590) at diskdump.c:584
> 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
> (gdb) list
> 579                     if (ret != KDUMP_OK)
> 580                             return ret;
> 581             } else if (pd.flags & DUMP_DH_COMPRESSED_LZO) {
> 582     #if USE_LZO
> 583                     lzo_uint retlen = get_page_size(ctx);
> 584                     int ret = lzo1x_decompress_safe(fch.data, pd.size,
> 585                                                     pio->chunk.data,
> 586                                                     &retlen,
> 587                                                     LZO1X_MEM_DECOMPRESS);
> 588                     fcache_put_chunk(&fch);
> (gdb) p retlen
> $7 = 0

This is a bit weird. Looking at liblzo sources, it seems to me that
the output length is not changed until right before returning from
lzo1x_decompress_safe().

> (gdb) p pio->chunk.data
> $8 = (void *) 0x0

OK, here's our immediate root cause. ;-)

> (gdb) p fch.data
> $9 = (void *) 0x7fcd7cc33da4

This looks sane.

> (gdb) p pd.size
> $10 = 816

This also looks sane.

> %<-----------------------
> 
> As far as I can tell, pio->chunk.data comes directly from the
> cache_get_page() function in frame 2:
> 
> %<-----------------------
> (gdb) up
> #2  0x00007fcd9ae32d4d in _kdumpfile_priv_cache_get_page (pio=0x7ffdee2c1590,
>     fn=0x7fcd9ae257ae <diskdump_read_page>) at read.c:69
> 69              ret = fn(pio);
> (gdb) list
> 64              pio->chunk.data = entry->data;
> 65              pio->chunk.embed_fces->ce = entry;
> 66              if (cache_entry_valid(entry))
> 67                      return KDUMP_OK;
> 68
> 69              ret = fn(pio);
> 70              mutex_lock(&ctx->shared->cache_lock);
> 71              if (ret == KDUMP_OK)
> 72                      cache_insert(pio->chunk.embed_fces->cache, entry);
> 73              else
> (gdb) p *entry
> $11 = {key = 1045860353, state = cs_precious, next = 626, prev = 626, refcnt = 1,
>   data = 0x0}

The key (0x3e569000 | ADDRXLAT_MACHPHYSADDR) corresponds to the
requested virtual address 0xffff88003e569c28.

> (gdb) p *pio
> $12 = {ctx = 0x55745bfca8f0, addr = {addr = 1045860352, as = ADDRXLAT_MACHPHYSADDR},
>   chunk = {data = 0x0, nent = 1, {embed_fces = {{data = 0xffff880ff1470788,
>           len = 140728599320032, ce = 0x55745c1003d8, cache = 0x55745c0fb540}, {
>           data = 0x55745bfd42f0, len = 140728599320112,
>           ce = 0x7fcd9ae330ef <kdump_read+102>, cache = 0xffff88003e569c28}},
>       fces = 0xffff880ff1470788}}}
> %<-----------------------

Looking at pio->chunk->embed_fces->ce, struct cache_entry is at
0x55745c1003d8. Assuming that sizeof(struct cache_entry) == 32 on your
system, this is element 626 in the cache entry array. The next and
prev indices are also 626, which looks good, because cache->ninflight
is 1, so this is the only element in the (circular) in-flight list.

Since state is cs_precious, but the data was discarded, this cache
entry has just been recovered from a ghost partition, evicting another
entry, and _that_ entry had a NULL data pointer.

It would be really helpful if I could get the log for this cache
instead of the one you posted earlier.

> 
> And here is the cache structure:
> 
> %<-----------------------
> (gdb) p *pio->chunk.embed_fces->cache
> $16 = {split = 487, nprec = 1020, ngprec = 248, nprobe = 3, ngprobe = 239, dprobe = 2,
>   cap = 1024, inflight = 626, ninflight = 1, hits = {number = 168473, address = 168473,
>     string = 0x29219 <error: Cannot access memory at address 0x29219>,
>     bitmap = 0x29219, blob = 0x29219}, misses = {number = 1913, address = 1913,
>     string = 0x779 <error: Cannot access memory at address 0x779>, bitmap = 0x779,
>     blob = 0x779}, elemsize = 4096, data = 0x7fcd997fe010, entry_cleanup = 0x0,
>   cleanup_data = 0x0, ce = 0x55745c0fb598}
> %<-----------------------

BTW with a 98.9% hit ratio (168473 hits, 1913 misses), the cache proves
to be quite effective for your workload. ;-)

Petr T

> Thanks for looking into this! I'll continue investigating more as well.
> 
> Stephen
>