All of lore.kernel.org
 help / color / mirror / Atom feed
* Exciting :-( adventures in metadata checksumming
@ 2012-08-03 19:55 George Spelvin
  2012-08-03 23:49 ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-03 19:55 UTC (permalink / raw)
  To: linux-ext4; +Cc: linux

(Search for "Not Good" to see the report of FILE SYSTEM CORRUPTION by
e2fsck 1.43-WIP (git 9f0dbd24f8af) with -O metadata_csum.)

I've been having some problems recently with a large ext4 RAID-6
array.  The FS suddenly switched to read-only after finding some
problems:

[635067.851004] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #98205884: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635067.851015] Aborting journal on device md0-8.
[635067.886082] EXT4-fs (md0): Remounting filesystem read-only
[635257.672659] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #98205885: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635274.620679] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478411: comm updatedb.mlocat: bad header/extent: invalid magic - magic 400a, entries 1, max 4(0), depth 0(0)
[635274.621006] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478403: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635274.693563] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478417: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635274.741286] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478407: comm updatedb.mlocat: bad header/extent: invalid magic - magic 20a, entries 1, max 4(0), depth 0(0)
[635274.741683] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478401: comm updatedb.mlocat: bad header/extent: invalid magic - magic 800a, entries 1, max 4(0), depth 0(0)
[635274.778130] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478415: comm updatedb.mlocat: bad header/extent: invalid magic - magic 630a, entries 1, max 4(0), depth 0(0)
[635274.785982] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478405: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635274.789177] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478419: comm updatedb.mlocat: bad header/extent: invalid magic - magic a, entries 1, max 4(0), depth 0(0)
[635274.791153] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478413: comm updatedb.mlocat: bad header/extent: invalid magic - magic c30a, entries 1, max 4(0), depth 0(0)
[635274.808709] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #133478409: comm updatedb.mlocat: bad header/extent: invalid magic - magic f00a, entries 1, max 4(0), depth 0(0)

I notice that the msbyte (the second byte) of the macig number appears to
be corrupted.  In particular, some subset of the bits that should be set
(the magic number is 0xf30a) have been cleared.

I'd suspect memory corruption, but any hardware problem that would clear
*that* many bits would not have left the machine running for a month
at a time.  It's been stably running Ubuntu 12.04 LTS and providing a
Samba server for some months now.

The abobe errors were from the precompiled 3.2.0-26 Ubunti kernel.

Anyway, after unmounting the file system and running e2fsck, I got a
large number of errors of the form

Extended attribute in inode 70975811 has a value size (0) which is invalid
Extended attribute in inode 70975820 has a value size (0) which is invalid
Extended attribute in inode 70975821 has a value size (0) which is invalid
Extended attribute in inode 70975822 has a value size (0) which is invalid
Extended attribute in inode 70975823 has a value size (0) which is invalid

however, the inode numbers affected do not overlap the set the kernel was
complaining about.

I let e2fsck fix those probleme (they were almost all Thumbs.db files on
Samba directories, so I wasn't too worried) and remounted the file system.

Three days later, more of the same!

[1038734.464464] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395710: comm chmod: bad header/extent: invalid magic - magic 510a, entries 1, max 4(0), depth 0(0)
[1038734.464474] Aborting journal on device md0-8.
[1038734.518844] EXT4-fs (md0): Remounting filesystem read-only
[1038734.519809] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395702: comm chmod: bad header/extent: invalid magic - magic 730a, entries 1, max 4(0), depth 0(0)
[1038734.521094] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395703: comm chmod: bad header/extent: invalid magic - magic d30a, entries 1, max 4(0), depth 0(0)
[1038734.526998] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395712: comm chmod: bad header/extent: invalid magic - magic d10a, entries 1, max 4(0), depth 0(0)
[1038734.527912] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395704: comm chmod: bad header/extent: invalid magic - magic f10a, entries 1, max 4(0), depth 0(0)
[1038734.529935] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395706: comm chmod: bad header/extent: invalid magic - magic 510a, entries 1, max 4(0), depth 0(0)
[1038734.531899] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395709: comm chmod: bad header/extent: invalid magic - magic d10a, entries 1, max 4(0), depth 0(0)
[1038734.532839] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395700: comm chmod: bad header/extent: invalid magic - magic 710a, entries 1, max 4(0), depth 0(0)
[1038734.536454] EXT4-fs error (device md0): ext4_ext_check_inode:398: inode #143395711: comm chmod: bad header/extent: invalid magic - magic d10a, entries 1, max 4(0), depth 0(0)

Same e2fsck results.  But I'm getting concerned.

So I think that the (mostly) successful e2fsck results show that the
*disk* data appears to be valid, and some form of corruption appears
to be affecting the read data.  Metadata checksums should catch the
problem sooner.

So let me try that!

But look, it's not supported by the Ubuntu 3.2.0 kernel.  No problem,
Quantal Quetzal has a 3.5 kernel that I can install that's already
configured properly.

But damn, even Quantal only has e2fsprogs 1.42.4, which does have -O
metadata_csum support.

git clone, compile... damn!  git master is 1.42.5, which doesn't have
it either!

But 1.43-WIP has it, so let me compile the "next" branch... success!

/root/ewfsprogs# misc/tune2fs -O metadata_csum /dev/md0
tune2fs 1.43-WIP (1-Aug-2012)

Please run e2fsck -D on the filesystem.
/root/e2fsprogs#

Oh, wait a minute... that better be the *new* e2fsck; the system one
doesn't have metadata_csum support!  I ought to install the new utilities
so that the system won't get stuck booting.  (Fortunately, the
RAID is not the root file system.)

A considerable amount of time trying to run "debuild -b -us -uc" and
"debian/rules binary" elapses.  I am unable to build a .deb.  Damn.
And debian/rules files are a complete maze of layers of helper
utilities that I have no idea how to debug. :-(

I'll have to just install local versions in /usr/local/sbin and ensure
they get used until Ubuntu catches up.

But in the meantime, let me run that e2fsck that's suggested...

/root/e2fsprogs# e2fsck/e2fsck -D -v -C0 /dev/md0
e2fsck 1.43-WIP (1-Aug-2012)
/dev/md0 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure                                           
Directory inode 1660934, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 3547141, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 80533520, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 100868100, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 103686159, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 107098118, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 107530256, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 112592908, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 114372621, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 119973900, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 120281096, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 122302465, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 124215315, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 127861088, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 131426306, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 133816331, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 140457985, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 141527045, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 142325769, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Directory inode 143130951, block #0, offset 4076: directory passes checks but fails checksum
Fix<y>? yes
Pass 3: Checking directory connectivity                                        
Pass 3A: Optimizing directories                                                
Pass 4: Checking reference counts                                              
Pass 5: Checking group summary information                                     
Free blocks count wrong for group #46928 (7000, counted=7001).                 
Fix<y>? yes
Free blocks count wrong for group #53136 (30333, counted=30334).
Fix<y>? yes
Free blocks count wrong (856654909, counted=856654911).
Fix<y>? yes
                                                                               
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

     1564799 inodes used (1.03%, out of 152619008)
        9199 non-contiguous files (0.6%)
         691 non-contiguous directories (0.0%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 1559941/4838
  1585236769 blocks used (64.92%, out of 2441891680)
           0 bad blocks
         370 large files

      580404 regular files
      984376 directories
           0 character device files
           0 block device files
           0 fifos
     1655417 links
           9 symbolic links (9 fast symbolic links)
           1 socket
------------
     3220207 files
/root/e2fsprogs# 

I'm not sure what's supposed to happen, but those seem like reasonably
harmless errors that tune2fs might have left behind.

I'm also not sure why
/backuppc/pc/localhost/{66..80}/f%2f/fusr/fshare/fman/fman1
had some lingering checksum problems, and no other directories.

But let me run e2fsck once more, just to be safe...

Oh, shit!  This is Not Good!

/root/e2fsprogs# e2fsck/e2fsck -v -C0 /dev/md0
e2fsck 1.43-WIP (1-Aug-2012)
/dev/md0: clean, 1564799/152619008 files, 1585236769/2441891680 blocks
/root/e2fsprogs# e2fsck/e2fsck -f -v -C0 /dev/md0
e2fsck 1.43-WIP (1-Aug-2012)
Pass 1: Checking inodes, blocks, and sizes
Inode 96108844 has an invalid extent node (blk 1537738982, lblk 0)             
Clear<y>? no
HTREE directory inode 96108844 has an invalid root node.
Clear HTree index<y>? no
Inode 96108844 is a zero-length directory.  Clear<y>? no
Inode 96108844, i_size is 24576, should be 0.  Fix<y>? no
Inode 96108844, i_blocks is 56, should be 0.  Fix<y>? no
Inode 108822615 has an invalid extent node (blk 1741162561, lblk 0)            
Clear<y>? no
HTREE directory inode 108822615 has an invalid root node.
Clear HTree index<y>? no
Inode 108822615 is a zero-length directory.  Clear<y>? no
Inode 108822615, i_size is 24576, should be 0.  Fix<y>? no
Inode 108822615, i_blocks is 56, should be 0.  Fix<y>? no
Pass 2: Checking directory structure                                           
Pass 3: Checking directory connectivity                                        
'..' in /backuppc/pc/localhost/57/f%2f/fusr/flib/fi386-linux-gnu (96108844) is <The NULL inode> (0), should be /backuppc/pc/localhost/57/f%2f/fusr/flib (96043127).
Fix<y>? no
Unconnected directory inode 96108845 (???)
Connect to /lost+found<y>? no
'..' in ... (96108845) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108846 (???)
Connect to /lost+found<y>? no
'..' in ... (96108846) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108847 (???)
Connect to /lost+found<y>? no
'..' in ... (96108847) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108848 (???)
Connect to /lost+found<y>? no
'..' in ... (96108848) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108850 (???)
Connect to /lost+found<y>? no
'..' in ... (96108850) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108852 (???)
Connect to /lost+found<y>? no
'..' in ... (96108852) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? no
Unconnected directory inode 96108853 (???)
Connect to /lost+found<y>? no
'..' in ... (96108853) is ??? (96108844), should be <The NULL inode> (0).
Fix<y>? 
/dev/md0: e2fsck canceled.

/dev/md0: ********** WARNING: Filesystem still has errors **********

/root/e2fsprogs# 


Eek!  Doubleplusungood.  Repeating the e2fsck, the errors seem to be
consistent.  For the record I did *nothing* to the file system between
the two runs except used debugfs in read-only mode to ncheck the inodes
that generated complaints.

In fact, I ran debugfs *during* the e2fsck directory optimize pass, and
it complained about some corruption, but I figured that was just e2fsck
at work, and debugfs (without -w) couldn't affect the e2fsck in any way.

If I let e2fsck fix the problems (fortunately, the trashed directory is
just an old backup), they appear to actually go away; another run comes
up clean.


For now, I have e2fsck manually installed, while I work on compiling a
kernel with Ted's ext4_for_linus fixes, especially that directory rename
checksum issue.


I'm not quite sure what's going on (I still haven't figured out the
original corruption problem), but I figured this was worth sharing anyway.

I can compile a kernel without error, so my RAM can't be in *that* bad
a shape.  (Ted, at least, will remember that from the days of 0.99pl13j.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-03 19:55 Exciting :-( adventures in metadata checksumming George Spelvin
@ 2012-08-03 23:49 ` Theodore Ts'o
  2012-08-04  1:42   ` George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2012-08-03 23:49 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Fri, Aug 03, 2012 at 03:55:08PM -0400, George Spelvin wrote:
> A considerable amount of time trying to run "debuild -b -us -uc" and
> "debian/rules binary" elapses.  I am unable to build a .deb.  Damn.
> And debian/rules files are a complete maze of layers of helper
> utilities that I have no idea how to debug. :-(

This is what I normally do when I build debian packages.  I normally
will create a tarball using the gen-tarball script in the util
directory (which is a generated file, so that means you need to run
"configure ; sh -vx util/gen-tarball" if you are using a freshly
checked out git tree.  In theory you should be able to do a debian
build out of the git tree, but it's not what I normally do....

So normally, I do something like this:

cp build/util/e2fsprogs-1.42.5.tar.gz /u1/debian-bld/e2fsprogs_1.42.5.orig.tar.gz
#
# optional "schroot -c sid" if you're building for unstable in a debian chroot
# goes here
#
cd /u1/debian-bld
tar xvfz e2fsprogs_1.42.5.orig.tar.gz
cd e2fsprogs-1.42.5
./debian/rules
dpkg-buildpackage -rfakeroot


The "./debian/rules" line generates various files in debian based on
your build environment.  It's there so that the build tree can work on
older verisions of Debian/Ubuntu LTS as well as the very latest
versions of debian.  (i.e., it figures out whether or not we need to
build the uuid and blkid libraries, or whether we are on a system
where the responsibility of those libraries have been moved to
util-linux-ng.  It figures out whether you are on a new enough version
of Debian/Ubuntu to require the new multi-arch support.)

> Eek!  Doubleplusungood.  Repeating the e2fsck, the errors seem to be
> consistent.  For the record I did *nothing* to the file system between
> the two runs except used debugfs in read-only mode to ncheck the inodes
> that generated complaints.

Hmm... I can't replicate the problem using a cleanly created file
system, copying a huge number of files to it, and then enabling
metadata_csum using tune2fs, and then running e2fsck -f on the device
again.

The fact that you are were seeing multiple cases of file system
corruption before you started using metadata_csum makes me very
suspicious, though.  I'm not sure whether you have a hardware problem,
or a bug in the md layer, or something else but the fact you were
seeing what looks like metadata corruption problems even before
turning on metadata_csum doesn't make it surprising that you might be
having the checksum failures reported!

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-03 23:49 ` Theodore Ts'o
@ 2012-08-04  1:42   ` George Spelvin
  2012-08-04 22:12     ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-04  1:42 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> This is what I normally do when I build debian packages.  I normally
> will create a tarball using the gen-tarball script in the util
> directory (which is a generated file, so that means you need to run
> "configure ; sh -vx util/gen-tarball" if you are using a freshly
> checked out git tree.  In theory you should be able to do a debian
> build out of the git tree, but it's not what I normally do....

Thanks for the info.  That's what I tried.  I also used "git archive"
to make the tarball.

I'll try it your way.

Lesson 1: gen-tarball must be run from the "util" directory, because it
tars up ".."; if you run it from the git root as shown above, it tars
up entirely too much!

Anyway, it appeared to work, but halted with one of the same errors I
encountered before:

gcc -c -I. -I../lib -I/tmp/build/e2fsprogs-1.43/lib -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -D__NO_STRING_INLINES /tmp/build/e2fsprogs-1.43/e2fsck/sigcatcher.c -o sigcatcher.o
gcc -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-rpath-link,../lib -rdynamic -o e2fsck dict.o unix.o e2fsck.o super.o pass1.o pass1b.o pass2.o pass3.o pass4.o pass5.o journal.o badblocks.o util.o dirinfo.o dx_dirinfo.o ehandler.o problem.o message.o quota.o recovery.o region.o revoke.o ea_refcount.o rehash.o profile.o prof_err.o logfile.o sigcatcher.o  ../lib/libquota.a ../lib/libext2fs.so ../lib/libcom_err.so  -lblkid    -luuid     ../lib/libe2p.so 
../lib/libcom_err.so: undefined reference to `sem_post'
../lib/libcom_err.so: undefined reference to `sem_wait'
../lib/libcom_err.so: undefined reference to `sem_init'
../lib/libcom_err.so: undefined reference to `sem_destroy'
collect2: error: ld returned 1 exit status
make[3]: *** [e2fsck] Error 1
make[3]: Leaving directory `/tmp/build/e2fsprogs-1.43/debian/BUILD-STD/e2fsck'
make[2]: *** [all-progs-recursive] Error 1
make[2]: Leaving directory `/tmp/build/e2fsprogs-1.43/debian/BUILD-STD'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/tmp/build/e2fsprogs-1.43/debian/BUILD-STD'
make: *** [debian/stampdir/build-std-stamp] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2

> Hmm... I can't replicate the problem using a cleanly created file
> system, copying a huge number of files to it, and then enabling
> metadata_csum using tune2fs, and then running e2fsck -f on the device
> again.

The corruption was on a backuppc directory, so if you're so inclined,
do a lot of hard-linking with "cp -l" as well.

There are 3220155 names in the file system, but only 1.5M inodes:
Filesystem        Inodes   IUsed     IFree IUse% Mounted on
/dev/md0       152619008 1565807 151053201    2% /data

What I *now* just realized is that, had my brain been in gear,
I should have run e2image on the file system *before* repairing it
for real.  What would have been highly informative.

I'm very very sorry.

> The fact that you are were seeing multiple cases of file system
> corruption before you started using metadata_csum makes me very
> suspicious, though.  I'm not sure whether you have a hardware problem,
> or a bug in the md layer, or something else but the fact you were
> seeing what looks like metadata corruption problems even before
> turning on metadata_csum doesn't make it surprising that you might be
> having the checksum failures reported!

Yes, I'm not sure what's going on, either.  updatedb found the problems
as it traversed the FS, but it does that *every* night, and literally
Nothing Happened the night of the failure.

It's also an oddly patterned and elusive error, with bits being
cleared in the high byte of the magic number, and then reappearing
when e2fsck looks at them.

One part of me thinks it's *got* to be a RAM problem, but I'd think
parallel kernel compiles and "git fsck" would catch that.  I've alo been
running updatedb manually, since that's what triggered last time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-04  1:42   ` George Spelvin
@ 2012-08-04 22:12     ` Theodore Ts'o
  2012-08-04 22:41       ` George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2012-08-04 22:12 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Fri, Aug 03, 2012 at 09:42:39PM -0400, George Spelvin wrote:
> ../lib/libcom_err.so: undefined reference to `sem_post'

Hmm... I have no idea why this is working for me, but I do see the
problem.  I suspect we must be using different versions of Libc, or
the linker, and my environment is much more forgiving, but the problem
is that we're linking -lpthread in the wrong order when creating
libcom_err.so; some how it's not breaking when I do my compiles
though.

The following patch should fix things.  I got a bug report which
pointed out the problem just recently.

					- Ted

commit 037b728b8a6a775e9a5e03fd24b1008d633c1cb4
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Aug 4 16:56:55 2012 -0400

    Put ELF_OTHER_LIBS in the right place for the linker
    
    Commit a7c17431b9 attempted to fix a problem where the system
    libraries might get used instead of local libraries for things like
    -lcom_err.  It tried to accomplish this by moving $(ELF_OTHER_LIBS) to
    before $(LDFLAGS).
    
    Unfortunately, this was the wrong fix; $(ELF_OTHER_LIBS) *MUST* be
    after the object files, or the linker might not pull in the necessary
    library and not include it into the DT_NEEDED section of the shared
    library.  The proper fix is to add a -L$(LIB) before $(LDFLAGS), and
    then remove the -L option from all of the ELF_OTHER_LIBS definitions
    in the library Makefiles.
    
    Addresses-Sourceforge-Bug: #3554345
    
    Cc: Olivier Blin <olivier.blin@softathome.com>
    Reported-by:  Mike Frysinger <vapier@gentoo.org>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/lib/Makefile.elf-lib b/lib/Makefile.elf-lib
index c66281c..6524df5 100644
--- a/lib/Makefile.elf-lib
+++ b/lib/Makefile.elf-lib
@@ -24,8 +24,8 @@ image:		$(ELF_LIB)
 
 $(ELF_LIB): $(OBJS)
 	$(E) "	GEN_ELF_SOLIB $(ELF_LIB)"
-	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) $(ELF_OTHER_LIBS) \
-		$(LDFLAGS) -Wl,-soname,$(ELF_SONAME) $(OBJS))
+	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) -L$(LIB) $(LDFLAGS) \
+		-Wl,-soname,$(ELF_SONAME) $(OBJS) $(ELF_OTHER_LIBS))
 	$(Q) $(MV) elfshared/$(ELF_LIB) .
 	$(Q) $(RM) -f ../$(ELF_LIB) ../$(ELF_IMAGE).so ../$(ELF_SONAME)
 	$(Q) (cd ..; $(LN) $(LINK_BUILD_FLAGS) \
diff --git a/lib/Makefile.solaris-lib b/lib/Makefile.solaris-lib
index 66f2b4c..2655ed8 100644
--- a/lib/Makefile.solaris-lib
+++ b/lib/Makefile.solaris-lib
@@ -24,8 +24,8 @@ image:		$(ELF_LIB)
 
 $(ELF_LIB): $(OBJS)
 	$(E) "	GEN_ELF_SOLIB $(ELF_LIB)"
-	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) $(ELF_OTHER_LIBS) \
-		$(LDFLAGS) -Wl,-h,$(ELF_SONAME) $(OBJS))
+	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) -L$(LIB) $(LDFLAGS) \
+		-Wl,-h,$(ELF_SONAME) $(OBJS) $(ELF_OTHER_LIBS))
 	$(Q) $(MV) elfshared/$(ELF_LIB) .
 	$(Q) $(RM) -f ../$(ELF_LIB) ../$(ELF_IMAGE).so ../$(ELF_SONAME)
 	$(Q) (cd ..; $(LN) $(LINK_BUILD_FLAGS) \
diff --git a/lib/blkid/Makefile.in b/lib/blkid/Makefile.in
index f23a137..0ec8564 100644
--- a/lib/blkid/Makefile.in
+++ b/lib/blkid/Makefile.in
@@ -36,7 +36,7 @@ ELF_SO_VERSION = 1
 ELF_IMAGE = libblkid
 ELF_MYDIR = blkid
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -luuid
+ELF_OTHER_LIBS = -luuid
 
 BSDLIB_VERSION = 2.0
 BSDLIB_IMAGE = libblkid
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 0d9ac21..fc196fb 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -180,7 +180,7 @@ ELF_SO_VERSION = 2
 ELF_IMAGE = libext2fs
 ELF_MYDIR = ext2fs
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -lcom_err
+ELF_OTHER_LIBS = -lcom_err
 
 BSDLIB_VERSION = 2.1
 BSDLIB_IMAGE = libext2fs
diff --git a/lib/quota/Makefile.in b/lib/quota/Makefile.in
index 2851eac..720befd 100644
--- a/lib/quota/Makefile.in
+++ b/lib/quota/Makefile.in
@@ -31,7 +31,7 @@ LIBDIR= quota
 #ELF_IMAGE = libquota
 #ELF_MYDIR = quota
 #ELF_INSTALL_DIR = $(root_libdir)
-#ELF_OTHER_LIBS = -L../.. -lext2fs
+#ELF_OTHER_LIBS = -lext2fs
 
 #BSDLIB_VERSION = 1.0
 #BSDLIB_IMAGE = libquota
diff --git a/lib/ss/Makefile.in b/lib/ss/Makefile.in
index 19413cc..c396f2d 100644
--- a/lib/ss/Makefile.in
+++ b/lib/ss/Makefile.in
@@ -20,7 +20,7 @@ ELF_SO_VERSION = 2
 ELF_IMAGE = libss
 ELF_MYDIR = ss
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -lcom_err $(DLOPEN_LIB)
+ELF_OTHER_LIBS = -lcom_err $(DLOPEN_LIB)
 
 BSDLIB_VERSION = 1.0
 BSDLIB_IMAGE = libss

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-04 22:12     ` Theodore Ts'o
@ 2012-08-04 22:41       ` George Spelvin
  2012-08-06 16:47         ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-04 22:41 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> commit 037b728b8a6a775e9a5e03fd24b1008d633c1cb4
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Sat Aug 4 16:56:55 2012 -0400
>
>     Put ELF_OTHER_LIBS in the right place for the linker

Thanks for the update.  That produces (following your procedure from
the previous e-mail exactly, modulo directory names) a *different* error...

[...snippage...]
gcc -I. -I../../lib -I/tmp/e2fsprogs-1.43/lib -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -D__NO_STRING_INLINES -c /tmp/e2fsprogs-1.43/lib/ss/get_readline.c
gcc -I. -I../../lib -I/tmp/e2fsprogs-1.43/lib -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -D__NO_STRING_INLINES -DSHARED_ELF_LIB -fPIC -o elfshared/get_readline.o -c /tmp/e2fsprogs-1.43/lib/ss/get_readline.c
(if test -r libss.a; then /bin/rm -f libss.a.bak && /bin/mv libss.a libss.a.bak; fi)
ar rc libss.a ss_err.o std_rqs.o invocation.o help.o execute_cmd.o listen.o parse.o error.o prompt.o request_tbl.o list_rqs.o pager.o requests.o data.o get_readline.o
/bin/rm -f ../libss.a
(cd ..; /bin/ln  \
                `echo lib/ss | sed -e 's;lib/;;'`/libss.a libss.a)
(cd elfshared; gcc --shared -o libss.so.2.0 -L../../lib -Wl,-Bsymbolic-functions -Wl,-z,relro \
                -Wl,-soname,libss.so.2 ss_err.o std_rqs.o invocation.o help.o execute_cmd.o listen.o parse.o error.o prompt.o request_tbl.o list_rqs.o pager.o requests.o data.o get_readline.o -lcom_err -ldl)
/usr/bin/ld: cannot find -lcom_err
collect2: error: ld returned 1 exit status
make[3]: *** [libss.so.2.0] Error 1
make[3]: Leaving directory `/tmp/e2fsprogs-1.43/debian/BUILD-STD/lib/ss'
make[2]: *** [all-libs-recursive] Error 1
make[2]: Leaving directory `/tmp/e2fsprogs-1.43/debian/BUILD-STD'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/tmp/e2fsprogs-1.43/debian/BUILD-STD'
make: *** [debian/stampdir/build-std-stamp] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2


libcom_err certainly exists if you look in the right places:

/tmp/e2fsprogs-1.43$ find debian -name \*com_err\* 
debian/BUILD-STD/lib/libcom_err.so.2.1
debian/BUILD-STD/lib/libcom_err.so.2
debian/BUILD-STD/lib/libcom_err.a
debian/BUILD-STD/lib/et/libcom_err.so.2.1
debian/BUILD-STD/lib/et/elfshared/com_err.o
debian/BUILD-STD/lib/et/com_err.pc
debian/BUILD-STD/lib/et/com_err.o
debian/BUILD-STD/lib/et/libcom_err.a
debian/BUILD-STD/lib/libcom_err.so

The ld was running in debian/BUILD-STD/lib/ss/elfshared, so -L../../lib
is debian/BUILD-STD/lib/lib, which does exist.  Is that right?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-04 22:41       ` George Spelvin
@ 2012-08-06 16:47         ` Theodore Ts'o
  2012-08-06 18:14           ` George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2012-08-06 16:47 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Sat, Aug 04, 2012 at 06:41:39PM -0400, George Spelvin wrote:
> > commit 037b728b8a6a775e9a5e03fd24b1008d633c1cb4
> > Author: Theodore Ts'o <tytso@mit.edu>
> > Date:   Sat Aug 4 16:56:55 2012 -0400
> >
> >     Put ELF_OTHER_LIBS in the right place for the linker
> 
> Thanks for the update.  That produces (following your procedure from
> the previous e-mail exactly, modulo directory names) a *different* error...

Oops.  Sorry, I screwed up that last patch, because I forgot that we
were in the elfshared directory.  Since I had comerr-dev, et. al.,
installed, I didn't notice because I was just linking against the
system libraries.  Thanks for testing and pointing out this!

Here's a replacement patch which should work; I've tested it after
deleting comerr-dev, e2fslibs-dev, etc., and it works, so I'm pretty
confident I got it right this time.

Regards,

						- Ted

commit d5aa6a82b37a0e78d8882601e6ad9da9d9dcb4da
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Aug 4 16:56:55 2012 -0400

    Put ELF_OTHER_LIBS in the right place for the linker
    
    Commit a7c17431b9 attempted to fix a problem where the system
    libraries might get used instead of local libraries for things like
    -lcom_err.  It tried to accomplish this by moving $(ELF_OTHER_LIBS) to
    before $(LDFLAGS).
    
    Unfortunately, this was the wrong fix; $(ELF_OTHER_LIBS) *MUST* be
    after the object files, or the linker might not pull in the necessary
    library and not include it into the DT_NEEDED section of the shared
    library.  The proper fix is to add a -L$(LIB) before $(LDFLAGS), and
    then remove the -L option from all of the ELF_OTHER_LIBS definitions
    in the library Makefiles.
    
    Addresses-Sourceforge-Bug: #3554345
    
    Cc: Olivier Blin <olivier.blin@softathome.com>
    Reported-by:  Mike Frysinger <vapier@gentoo.org>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/lib/Makefile.elf-lib b/lib/Makefile.elf-lib
index c66281c..78479d3 100644
--- a/lib/Makefile.elf-lib
+++ b/lib/Makefile.elf-lib
@@ -24,8 +24,9 @@ image:		$(ELF_LIB)
 
 $(ELF_LIB): $(OBJS)
 	$(E) "	GEN_ELF_SOLIB $(ELF_LIB)"
-	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) $(ELF_OTHER_LIBS) \
-		$(LDFLAGS) -Wl,-soname,$(ELF_SONAME) $(OBJS))
+	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) \
+		-L$(top_builddir)/../lib $(LDFLAGS) \
+		-Wl,-soname,$(ELF_SONAME) $(OBJS) $(ELF_OTHER_LIBS))
 	$(Q) $(MV) elfshared/$(ELF_LIB) .
 	$(Q) $(RM) -f ../$(ELF_LIB) ../$(ELF_IMAGE).so ../$(ELF_SONAME)
 	$(Q) (cd ..; $(LN) $(LINK_BUILD_FLAGS) \
diff --git a/lib/Makefile.solaris-lib b/lib/Makefile.solaris-lib
index 66f2b4c..5990be8 100644
--- a/lib/Makefile.solaris-lib
+++ b/lib/Makefile.solaris-lib
@@ -24,8 +24,9 @@ image:		$(ELF_LIB)
 
 $(ELF_LIB): $(OBJS)
 	$(E) "	GEN_ELF_SOLIB $(ELF_LIB)"
-	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) $(ELF_OTHER_LIBS) \
-		$(LDFLAGS) -Wl,-h,$(ELF_SONAME) $(OBJS))
+	$(Q) (cd elfshared; $(CC) --shared -o $(ELF_LIB) \
+		-L$(top_builddir)/../lib $(LDFLAGS) \
+		-Wl,-h,$(ELF_SONAME) $(OBJS) $(ELF_OTHER_LIBS))
 	$(Q) $(MV) elfshared/$(ELF_LIB) .
 	$(Q) $(RM) -f ../$(ELF_LIB) ../$(ELF_IMAGE).so ../$(ELF_SONAME)
 	$(Q) (cd ..; $(LN) $(LINK_BUILD_FLAGS) \
diff --git a/lib/blkid/Makefile.in b/lib/blkid/Makefile.in
index f23a137..0ec8564 100644
--- a/lib/blkid/Makefile.in
+++ b/lib/blkid/Makefile.in
@@ -36,7 +36,7 @@ ELF_SO_VERSION = 1
 ELF_IMAGE = libblkid
 ELF_MYDIR = blkid
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -luuid
+ELF_OTHER_LIBS = -luuid
 
 BSDLIB_VERSION = 2.0
 BSDLIB_IMAGE = libblkid
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index 0d9ac21..fc196fb 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -180,7 +180,7 @@ ELF_SO_VERSION = 2
 ELF_IMAGE = libext2fs
 ELF_MYDIR = ext2fs
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -lcom_err
+ELF_OTHER_LIBS = -lcom_err
 
 BSDLIB_VERSION = 2.1
 BSDLIB_IMAGE = libext2fs
diff --git a/lib/quota/Makefile.in b/lib/quota/Makefile.in
index 2851eac..720befd 100644
--- a/lib/quota/Makefile.in
+++ b/lib/quota/Makefile.in
@@ -31,7 +31,7 @@ LIBDIR= quota
 #ELF_IMAGE = libquota
 #ELF_MYDIR = quota
 #ELF_INSTALL_DIR = $(root_libdir)
-#ELF_OTHER_LIBS = -L../.. -lext2fs
+#ELF_OTHER_LIBS = -lext2fs
 
 #BSDLIB_VERSION = 1.0
 #BSDLIB_IMAGE = libquota
diff --git a/lib/ss/Makefile.in b/lib/ss/Makefile.in
index 19413cc..c396f2d 100644
--- a/lib/ss/Makefile.in
+++ b/lib/ss/Makefile.in
@@ -20,7 +20,7 @@ ELF_SO_VERSION = 2
 ELF_IMAGE = libss
 ELF_MYDIR = ss
 ELF_INSTALL_DIR = $(root_libdir)
-ELF_OTHER_LIBS = -L../.. -lcom_err $(DLOPEN_LIB)
+ELF_OTHER_LIBS = -lcom_err $(DLOPEN_LIB)
 
 BSDLIB_VERSION = 1.0
 BSDLIB_IMAGE = libss

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-06 16:47         ` Theodore Ts'o
@ 2012-08-06 18:14           ` George Spelvin
  2012-08-06 22:12             ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-06 18:14 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> Here's a replacement patch which should work; I've tested it after
> deleting comerr-dev, e2fslibs-dev, etc., and it works, so I'm pretty
> confident I got it right this time.

Thanks!  Successful build on Ubuntu 12.04.
(No patch required on Debian/unstable.)

(Well, some problems with multiarch and not wanting to *install* without
a new libcomerr2:i386 to match the new libcomerr2:amd64, but that's
something I have to figure out.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-06 18:14           ` George Spelvin
@ 2012-08-06 22:12             ` Theodore Ts'o
  2012-08-06 22:59               ` George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2012-08-06 22:12 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Mon, Aug 06, 2012 at 02:14:41PM -0400, George Spelvin wrote:
> 
> (Well, some problems with multiarch and not wanting to *install* without
> a new libcomerr2:i386 to match the new libcomerr2:amd64, but that's
> something I have to figure out.)

Obvious stupid question --- you don't have a 32-bit krb5 package or
something else which depewnds on libcomerr2:i386?

I'm using Debain testing and haven't noted any issues like this...

    	  	 	     	     	       - Ted





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-06 22:12             ` Theodore Ts'o
@ 2012-08-06 22:59               ` George Spelvin
  2012-08-06 23:25                 ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-06 22:59 UTC (permalink / raw)
  To: linux, tytso; +Cc: linux-ext4

> Obvious stupid question --- you don't have a 32-bit krb5 package or
> something else which depewnds on libcomerr2:i386?

Yes, exactly.  Just deleting the damn thing was the first thing I
thought of.  The dependency chain is long and annoying and ends in
something that's not 64-bit safe and so depends on ia32-libs.

That's a catch-all package; I don't know if the dependency is real
or not.  I worked around it in the simple stupid way by editing
/var/lig/dpkg/status and telling it that the installed verion is
1.43~WIP-2012-08-02-1; it's not like there's a meaningful difference.

> I'm using Debain testing and haven't noted any issues like this...

It's the Ubuntu machine.  Not your problem!

If you care, it ends up dying with the following, but I don't expect or need
your help.

i686-linux-gnu-gcc -c -I. -I../lib -I/tmp/e2fsprogs/lib -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -D__NO_STRING_INLINES /root/e2fsprogs/e2fsck/sigcatcher.c -o sigcatcher.o
i686-linux-gnu-gcc -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-rpath-link,../lib -rdynamic -o e2fsck dict.o unix.o e2fsck.o super.o pass1.o pass1b.o pass2.o pass3.o pass4.o pass5.o journal.o badblocks.o util.o dirinfo.o dx_dirinfo.o ehandler.o problem.o message.o quota.o recovery.o region.o revoke.o ea_refcount.o rehash.o profile.o prof_err.o logfile.o sigcatcher.o  ../lib/libquota.a ../lib/libext2fs.so ../lib/libcom_err.so     ../lib/libe2p.so 
unix.o: In function `PRS':
/tmp/e2fsprogs/e2fsck/unix.c:765: undefined reference to `blkid_get_cache'
/tmp/e2fsprogs/e2fsck/unix.c:857: undefined reference to `blkid_get_devname'
/tmp/e2fsprogs/e2fsck/unix.c:931: undefined reference to `blkid_get_devname'
e2fsck.o: In function `e2fsck_free_context':
/tmp/e2fsprogs/e2fsck/e2fsck.c:179: undefined reference to `blkid_put_cache'
super.o: In function `check_super_block':
/tmp/e2fsprogs/e2fsck/super.c:728: undefined reference to `uuid_is_null'
/tmp/e2fsprogs/e2fsck/super.c:730: undefined reference to `uuid_generate'
journal.o: In function `e2fsck_journal_load':
/tmp/e2fsprogs/e2fsck/journal.c:581: undefined reference to `uuid_is_null'
journal.o: In function `e2fsck_get_journal':
/tmp/e2fsprogs/e2fsck/journal.c:315: undefined reference to `uuid_is_null'
/tmp/e2fsprogs/e2fsck/journal.c:397: undefined reference to `uuid_unparse'
/tmp/e2fsprogs/e2fsck/journal.c:398: undefined reference to `blkid_get_devname'
/tmp/e2fsprogs/e2fsck/journal.c:401: undefined reference to `blkid_devno_to_devname'
journal.o: In function `e2fsck_check_ext3_journal':
/tmp/e2fsprogs/e2fsck/journal.c:766: undefined reference to `uuid_is_null'
journal.o: In function `e2fsck_journal_reset_super':
/tmp/e2fsprogs/e2fsck/journal.c:683: undefined reference to `uuid_generate'
journal.o: In function `e2fsck_fix_ext3_journal_hint':
/tmp/e2fsprogs/e2fsck/journal.c:1117: undefined reference to `uuid_is_null'
/tmp/e2fsprogs/e2fsck/journal.c:1120: undefined reference to `uuid_unparse'
/tmp/e2fsprogs/e2fsck/journal.c:1121: undefined reference to `blkid_get_devname'
dirinfo.o: In function `setup_tdb':
/tmp/e2fsprogs/e2fsck/dirinfo.c:63: undefined reference to `uuid_unparse'
collect2: error: ld returned 1 exit status
make[3]: *** [e2fsck] Error 1
make[3]: Leaving directory `/tmp/e2fsprogs/debian/BUILD-STD/e2fsck'
make[2]: *** [all-progs-recursive] Error 1
make[2]: Leaving directory `/tmp/e2fsprogs/debian/BUILD-STD'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/tmp/e2fsprogs/debian/BUILD-STD'
make: *** [debian/stampdir/build-std-stamp] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2
debuild: fatal error at line 1350:
dpkg-buildpackage -rfakeroot -d -us -uc -b -ai386 failed


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-06 22:59               ` George Spelvin
@ 2012-08-06 23:25                 ` Theodore Ts'o
  2012-08-08 13:39                   ` metadata_csum Oops George Spelvin
  2012-08-08 22:34                   ` Exciting :-( adventures in metadata checksumming George Spelvin
  0 siblings, 2 replies; 15+ messages in thread
From: Theodore Ts'o @ 2012-08-06 23:25 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-ext4

On Mon, Aug 06, 2012 at 06:59:37PM -0400, George Spelvin wrote:
> dpkg-buildpackage: error: debian/rules build gave error exit status 2
> debuild: fatal error at line 1350:
> dpkg-buildpackage -rfakeroot -d -us -uc -b -ai386 failed

I don't think -ai386 will work for e2fsprogs since we need external
libraries; specifically, libblkid-dev and libuuid-dev and the dev
packages aren't multiarch compatible.  (In fact, I'm not sure -ai386
to build 32-bit packages in an x86 environment will work in general.
It's certainly not the standard way 32-bit packages are built.)

The failures you're seeing is because "pkg-config --libs blkid" is
returning a null string when run in the dpkg-buildpackage -ai386
environment --- which is not surprising, since we don't have a -32/-64
bit link libraries in libblkid-dev.

So if you want to build a 32-bit set of packages of e2fsprogs, you'll
need to make a 32-bit build environment as a chroot, using debootstrap
and build e2fsprogs in the 32-bit chroot.

That's actually how I build my packages for Debian, BTW --- I have a
32-bit chroot, and I build the binary packages for i386 and upload
them from there.  I let the autobuilders build the 64-bit binary
packages from the source upload.  That way, the most commonly used
binary packages are built in a standard autobuilder environment, and
are not subject to the vagracies of my build environment.  It also
means that I don't have to worry about the build scripts bitrot for
the 32-bit packages.  (I also build 64-bit debs for my own use --- but
I don't upload them.)

Regards,

	   	      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* metadata_csum Oops
  2012-08-06 23:25                 ` Theodore Ts'o
@ 2012-08-08 13:39                   ` George Spelvin
  2012-08-08 22:34                   ` Exciting :-( adventures in metadata checksumming George Spelvin
  1 sibling, 0 replies; 15+ messages in thread
From: George Spelvin @ 2012-08-08 13:39 UTC (permalink / raw)
  To: linux-ext4, tytso; +Cc: linux

The machine I reported earlier issues has not glitched yet.
However, I have some more fun to report.

To further test metadata checksums, I enabled it on my desktop machine.
(Also has SSE4.2, so the checksum overhead should be minimal.)  This is
a Debian/unstable lachine, with 64-bit kernel (v3.5 + ext4-for-linus)
and 32-bit userland, with e2fsprogs from the next branch.

This time I included the root FS, which gave some interesting issues
last night during the network backup run.  This morning I was greeted by:

BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [<ffffffff810ffbf9>] ext4_readdir+0x1e2/0x5a8
PGD 1589067 PUD 158a067 PMD 0
Oops: 0000 [#1] SMP
CPU 1
Modules linked in:  battery nfds exportfs deflate zlib_deflate zlib_inflate ctr <whole bunch of crypto modules snipped> crypto_null af_key xfrm_algo fuse ftdi_sio usbserial r8199
Pid: 31650, comm: rsync Not tainted 3.5.0-00032-g0e1gf37 #55 Gigabyte Technology Co., Ltd. H55M-UD2H/H55M-UD2H
RIP: 0010:[<ffffffff810ffbf9>]  [<ffffffff810ffbf9>] ext4_readdir+0x1e2/0x5a8
RSP: 0018:ffff88010eb65e38  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800b6763300 RCX: 0000000000000000
RDX: ffffffff810e5d99 RSI: ffff88010eb65f40 RDI: ffff8800b6763300
RBP: ffff88010eb65ed8 R08: 0000000000013750 R09: ffffea000445bd40
R10: 0000000000000000 R11: ffffffff8111a578 R12: ffff880108808740
R13: ffff88003ab13748 R14: ffff8801137e8400 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880117c80000(0063) knlGS:00000000f760d6c0
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: fffffffffffffff8 CR3: 000000011177e000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rsync (pid: 31650, threadinfo ffff88010eb64000, task ffff880111d6a5e0)
Stack:
 ffff88010eb65f08 ffffffff810bc5e9 ffff880113782220 ffff880000000000
 00000005756e69e4 ffff88011379501e ffffffff810e5d99 ffff88010eb65f40 
 ffff88003ab13748 ffff88003ab13748 0000000000000000 ffffffff813dc01c
Call Trace:
 [<ffffffff810bc5e9>] ? do_filp_open+0x33/0x81
 [<ffffffff810e5d99>] ? compat_filldir+0xdd/0xdd
 [<ffffffff813dc01c>] ? _cond_resched+0x9/0x1d
 [<ffffffff8104437b>] ? shoud_resched+0x9/0x28
 [<ffffffff810e5d99>] ? compat_filldir+0xdd/0xdd
 [<ffffffff810be52e>] vfs_readdir+0x61/0x9a
 [<ffffffff810aa3b5>] ? kmem_cache_free+0x15/0x6e
 [<ffffffff810e729e>] compat_sys_getdents64+0x72/0xcc
 [<ffffffff813de59b>] sysenter_dispatch+0x7/0x1e
Code: 00 83 f8 00 0f 8c a6 00 00 00 75 02 eb 6d 4c 89 e7 e8 49 5b 09 00 49 89 44 24 08 49 8b 4c 24 08 48 8b 55 90 48 89 df 48 8b 75 98 <8b> 41 f8 41 89 44 24 20 8b 41 fc 48 83 e9 08 41 89 44 24 24 e8
RIP  [<ffffffff810ffbf9>] ext4_readdir+0x1e2/0x5a8
 RSP <ffff88010eb65e38>
CR2: fffffffffffffff8

(The above was hand-transcribed, so I *hope* I got all the hex correct!)

Anyway, on reboot, the system came up, but would not fsck, printing:

fsck.ext4: Superblock checksum does not match superblock while trying to open /dev/sda2
/dev/sda2:
The superblock could not be read or does not describe a correct ext2
filesstem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

fsck died with exit code 8


So I logged in by hand and tried to run the recommended command, but!
Including "-b 8193" produced the above message, while *omitting* it
actually ran successfully.  I'm a little confused by that.
(The "fsck" wrapper invoked from /etc/init.d/checkroot.sh is 2.20.1-5.1.)

For some limited values of "successfully"; it sure found a lot of problems:

(This is the second run; I stopped the first and started capturing it
when it was obvious pencil and paper was impractical.)

Script started on Wed Aug  8 08:40:45 2012
/run# e2fsck -y -v -C0 /dev/sda2
e2fsck 1.43-WIP (1-Aug-2012)
root was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
... some similar errors from earlier run omitted ...
Inode 947580 checksum does not match inode.  Clear? yes
Inode 947581 checksum does not match inode.  Clear? yes
Inode 947582 checksum does not match inode.  Clear? yes
Inode 947583 checksum does not match inode.  Clear? yes
Inode 947584 checksum does not match inode.  Clear? yes
Inode 947585 checksum does not match inode.  Clear? yes
Inode 947586 checksum does not match inode.  Clear? yes
Inode 947587 checksum does not match inode.  Clear? yes
Inode 947588 checksum does not match inode.  Clear? yes
Inode 947589 checksum does not match inode.  Clear? yes
Inode 947590 checksum does not match inode.  Clear? yes
Inode 947591 checksum does not match inode.  Clear? yes
... 315 additional deleted ...
Inode 947920 checksum does not match inode.  Clear? yes

Pass 2: Checking directory structure
Entry 'linux' in /usr/arm-linux-gnueabi/include (534036) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534570) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534582) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534660) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534681) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534695) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534737) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534785) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534796) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534805) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534820) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534824) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534860) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534911) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534947) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534959) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (534978) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (535012) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (535026) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (535045) has deleted/unused inode 534518.  Clear? yes
Entry '..' in ??? (535065) has deleted/unused inode 534518.  Clear? yes

Pass 3: Checking directory connectivity
Unconnected directory inode 534570 (...)
Connect to /lost+found? yes
... suplicates snipped ...

Pass 4: Checking reference counts
Inode 534036 ref count is 30, should be 29.  Fix? yes
Inode 534570 ref count is 3, should be 2.  Fix? yes
Inode 534582 ref count is 4, should be 3.  Fix? yes
Inode 534660 ref count is 3, should be 2.  Fix? yes
Inode 534681 ref count is 3, should be 2.  Fix? yes
Inode 534695 ref count is 3, should be 2.  Fix? yes
Inode 534737 ref count is 3, should be 2.  Fix? yes
Inode 534785 ref count is 3, should be 2.  Fix? yes
Unattached inode 534795  Connect to /lost+found? yes
Inode 534795 ref count is 2, should be 1.  Fix? yes
Inode 534796 ref count is 3, should be 2.  Fix? yes
Unattached inode 534797  Connect to /lost+found? yes
Inode 534797 ref count is 2, should be 1.  Fix? yes
... snip ...
Unattached inode 542215  Connect to /lost+found? yes
Inode 542215 ref count is 2, should be 1.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(3833858--3833859) -3833862 -(3833864--3833865) -3833867 -3833870 -(3833872--3833873)  ...2.5 MB snipped...
Fix? yes

Free blocks count wrong for group #0 (27905, counted=27767).
Fix? yes
Free blocks count wrong for group #1 (2607, counted=2109).
Fix? yes
Free blocks count wrong for group #2 (4062, counted=4164).
Fix? yes
... snip ...
Free blocks count wrong for group #289 (21421, counted=21608).
Fix? yes
Free blocks count wrong (6043152, counted=6026661).
Fix? yes

Inode bitmap differences:  -(26241--26243) -(26245--26247) -26253 -26263 -26268 -(26271--26272) -(26275--26276) -26278 ... 1,000,000 bytes snipped ...
Fix? yes

Free inodes count wrong for group #0 (2, counted=31).
Fix? yes
Free inodes count wrong for group #1 (0, counted=934).
Fix? yes
Free inodes count wrong for group #2 (0, counted=1847).
Fix? yes
... snip ...
Free inodes count wrong for group #288 (486, counted=468).
Fix? yes

Free inodes count wrong (693942, counted=693943).
Fix? yes

root: ***** FILE SYSTEM WAS MODIFIED *****
root: ***** REBOOT LINUX *****

      286777 inodes used (29.24%, out of 980720)
         278 non-contiguous files (0.1%)
         172 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 262385/65
     3738850 blocks used (38.29%, out of 9765511)
           0 bad blocks
           0 large files

      238864 regular files
       21703 directories
         164 character device files
          10 block device files
           1 fifo
  4294967292 links
       26014 symbolic links (24132 fast symbolic links)
          12 sockets
------------
      286397 files

/run# e2fsck -f -v -C0 /dev/sda2
e2fsck 1.43-WIP (1-Aug-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

      286777 inodes used (29.24%, out of 980720)
         278 non-contiguous files (0.1%)
         173 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 262385/65
     3738850 blocks used (38.29%, out of 9765511)
           0 bad blocks
           0 large files

      238864 regular files
       21703 directories
         164 character device files
          10 block device files
           1 fifo
          36 links
       26014 symbolic links (24132 fast symbolic links)
          12 sockets
------------
      286804 files

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-06 23:25                 ` Theodore Ts'o
  2012-08-08 13:39                   ` metadata_csum Oops George Spelvin
@ 2012-08-08 22:34                   ` George Spelvin
  2012-08-08 23:42                     ` George Spelvin
  1 sibling, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-08 22:34 UTC (permalink / raw)
  To: linux-ext4, tytso; +Cc: linux

Yay!  It happened again!

[397113.459309] EXT4-fs error (device md0): ext4_iget:3821: inode #100691970: comm updatedb.mlocat: checksum invalid
[397113.459315] Aborting journal on device md0-8.
[397113.496222] EXT4-fs (md0): Remounting filesystem read-only

However, trying to preserve the evidence runs into a problem:

/root# e2image -Q /dev/md0 md0.qcow2
e2image 1.43-WIP (1-Aug-2012)
e2image: Inode checksum does not match inode while getting next inode
/root# ls -l md0.qcow2
-rw------- 1 root root 0 Aug  8 18:30 md0.qcow2

e2image aborts on the checksum error, preventing me from capturing a useful
image.

Can someone find a workaround QUICKLY?  I can't keep this FS read-only
for long.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-08 22:34                   ` Exciting :-( adventures in metadata checksumming George Spelvin
@ 2012-08-08 23:42                     ` George Spelvin
  2012-08-09  5:00                       ` George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-08 23:42 UTC (permalink / raw)
  To: linux-ext4, tytso; +Cc: linux

> Can someone find a workaround QUICKLY?  I can't keep this FS read-only
> for long.

I thought I had figured out a great workaround: Use 1.42.4, which doesn't
know how to check checksums.

But then I doscovered that it aborts and delivers a zero-length file
if there are filesystem inconsistencies, too!  So I get

e2image 1.42.4 (12-Jun-2012)
Illegal block number passed to ext2fs_mark_block_bitmap #3571066296 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2895243190 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #3276895043 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2488200263 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #2556839855 for in-use block map
... snip... (2671 total "Illegal block number passed" messages)
Illegal block number passed to ext2fs_mark_block_bitmap #3421917394 for in-use block map
Illegal block number passed to ext2fs_mark_block_bitmap #3469830505 for in-use block map
e2image: Illegal indirect block found while iterating over inode 85800474

I'm not sure this is The Right Thing To Do for a debugging tool.


The file system is a RAID-6, and repeated verifications have failed to find
RAID mismatches.

I am starting to suspect motherboard/RAM on this machine.  Already the bad
magic number error patterns looked odd to me, and I was just reminded that
we had to swap the RAM when it was first built so memtest8 would pass.
We ran it for many hours, but it *is* a consumer Intel box with no ECC.

And 8 GiB of RAM, and acting primarily as a file server, so FS metadata can
sit and bit-rot in RAM for a very long time.

I'm going to play with "hdparm -f" and drop_caches to see if I can make
the file system problems go away with no repair other than re-reading
from disk.

If so, That would confirm it as not ext4's problem.  Although it *would* be
a very cool debugging feature to re-check the checksum whenever a metadata
page is discarded from the buffer cache.

If the checksum matched when first read in, and doesn't when a supposedly
clean page is discarded, *something* is corrupting RAM.  (If you
assume that it's a single bit flip, then you can deduce the location
from the error syndrome.)


Anyway, thanks for the help!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Exciting :-( adventures in metadata checksumming
  2012-08-08 23:42                     ` George Spelvin
@ 2012-08-09  5:00                       ` George Spelvin
  2012-08-09 23:48                         ` Arrgh! Even more excitement with " George Spelvin
  0 siblings, 1 reply; 15+ messages in thread
From: George Spelvin @ 2012-08-09  5:00 UTC (permalink / raw)
  To: linux-ext4, linux, tytso

I think indeed it was a RAM issue.  After some cache-flushing, e2fsck
on the file system with the checksum errors reported no errors, even
though there wasn't even a reboot in between.

(Then I discovered the "e2fsck -F" flag, which would have been
a lot simpler than what I did instead.  Oh, well.)

Looking in the BIOS, I found some suspicious timing parameters that
might be responsible, and reset them to stock.  I'll know for sure if
the problem stays away for a month, but in the mean time, thank you
everyone for your help and patience, and consider the matter
resolved unless I find more problems.

Sorry for the false alarm.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Arrgh!  Even more excitement with metadata checksumming
  2012-08-09  5:00                       ` George Spelvin
@ 2012-08-09 23:48                         ` George Spelvin
  0 siblings, 0 replies; 15+ messages in thread
From: George Spelvin @ 2012-08-09 23:48 UTC (permalink / raw)
  To: linux-ext4, linux, tytso

Havimg, I think, fixed the entire memory corruption problem that inspired
me to turn on metadata checksums, I want to revert to the standard Ubuntu
kernel, but first I need to figure out how to turn them off!

The obvious answer is with tune2fs:

> ~# tune2fs -O ^metadata_csum /dev/md0
> tune2fs 1.43-WIP (1-Aug-2012)
> rewrite_extents: Corrupt extent header while rewriting extents
> ~# 

Arrgh!  The file system passes e2fsck fine.  Well, *now* it doesn't; it
bitches about a zillion missing directory checksums and the lack
of lost+found:

> Directory inode 82676842, block #1, offset 3260: directory passes checks but fails checksum
> Fix? yes
> 
> Directory inode 82676842, block #2, offset 3264: directory passes checks but fails checksum
> Fix? yes
> 
> Directory inode 82677333, block #1, offset 3260: directory passes checks but fails checksum
> Fix? yes
> 
> Directory inode 82677333, block #2, offset 1328: directory passes checks but fails checksum
> Fix? yes
> 
> Directory inode 82675733, block #3, offset 124: directory passes checks but fails checksum
> Fix? yes
> 
> Directory inode 82676842, block #3, offset 432: directory passes checks but fails checksum
> Fix? yes
> 
> Pass 3: Checking directory connectivity
> Error while trying to find /lost+found: Directory block checksum does not match directory block
> /lost+found not found.  Create? yes
> 
> Error creating /lost+found directory (ext2fs_link): Directory block checksum does not match directory block
> Pass 3A: Optimizing directories
> Pass 4: Checking reference counts
> Unattached inode 12
> Connect to /lost+found? yes
> 
> Pass 5: Checking group summary information
> Block bitmap differences:  -937953738
> Fix? yes
> 
> Free blocks count wrong for group #28624 (28170, counted=28171).
> Fix? yes
> 
> Free blocks count wrong (841115975, counted=841115976).
> Fix? yes
> 
> 
> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****

But *after* that, it passes fine.

Unfortunately, I don't know where the problem is or I'd see if I could
just delete the damn problematic file.  (Or move it somewhere else
temporarily.)

I'm adding debugging to tune2fs and trying to track down the source.

Step 1:
eh_magic = dabe != f30a
Problem with extent of inode #85800449
rewrite_extents: Corrupt extent header while rewriting extents

Step 2:
debugfs:  ncheck 85800449
Inode   Pathname
debugfs:  testi <85800449>
Inode 85800449 is not in use

Step 3: ???
The print is in ext2fs_extent_open2, and I'm just printing the "ino"
parameter if ext2fs_extent_header_verify fails.

Why is tune2fs looking at an inode that's not in use?
That *would* explain the magic number error...


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-08-09 23:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-03 19:55 Exciting :-( adventures in metadata checksumming George Spelvin
2012-08-03 23:49 ` Theodore Ts'o
2012-08-04  1:42   ` George Spelvin
2012-08-04 22:12     ` Theodore Ts'o
2012-08-04 22:41       ` George Spelvin
2012-08-06 16:47         ` Theodore Ts'o
2012-08-06 18:14           ` George Spelvin
2012-08-06 22:12             ` Theodore Ts'o
2012-08-06 22:59               ` George Spelvin
2012-08-06 23:25                 ` Theodore Ts'o
2012-08-08 13:39                   ` metadata_csum Oops George Spelvin
2012-08-08 22:34                   ` Exciting :-( adventures in metadata checksumming George Spelvin
2012-08-08 23:42                     ` George Spelvin
2012-08-09  5:00                       ` George Spelvin
2012-08-09 23:48                         ` Arrgh! Even more excitement with " George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.