All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph on btrfs 3.4rc
@ 2012-04-20 15:09 Christian Brunner
  2012-04-23  7:20   ` Christian Brunner
                   ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-20 15:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4009 bytes --]

After running ceph on XFS for some time, I decided to try btrfs again.
Performance with the current "for-linux-min" branch and big metadata
is much better. The only problem (?) I'm still seeing is a warning
that seems to occur from time to time:

[87703.784552] ------------[ cut here ]------------
[87703.789759] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[87703.799070] Hardware name: ProLiant DL180 G6
[87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
3.3.2-1.fits.1.el6.x86_64 #1
[87703.837513] Call Trace:
[87703.840280]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[87703.847016]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[87703.853533]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[87703.861541]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[87703.868674]  [<ffffffffa0351bcb>]
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[87703.876745]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
[87703.882966]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
[87703.888997]  [<ffffffffa0352040>] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[87703.897271]  [<ffffffffa035205f>] do_async_commit+0x1f/0x30 [btrfs]
[87703.904262]  [<ffffffff81068949>] process_one_work+0x129/0x450
[87703.910777]  [<ffffffff8106b7eb>] worker_thread+0x17b/0x3c0
[87703.916991]  [<ffffffff8106b670>] ? manage_workers+0x220/0x220
[87703.923504]  [<ffffffff810703fe>] kthread+0x9e/0xb0
[87703.928952]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
[87703.935555]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
[87703.943323]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
[87703.949149] ---[ end trace b8c31966cca731fa ]---
[91128.812399] ------------[ cut here ]------------
[91128.817576] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[91128.826930] Hardware name: ProLiant DL180 G6
[91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: P        W  O
3.3.2-1.fits.1.el6.x86_64 #1
[91128.865912] Call Trace:
[91128.868670]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[91128.875379]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[91128.881900]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[91128.889894]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[91128.897019]  [<ffffffffa03a2b61>] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[91128.905075]  [<ffffffffa0351bcb>]
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[91128.913156]  [<ffffffffa03524b2>] ? start_transaction+0x92/0x310 [btrfs]
[91128.920643]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
[91128.926667]  [<ffffffffa034cfcb>] transaction_kthread+0x26b/0x2e0 [btrfs]
[91128.934254]  [<ffffffffa034cd60>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.943671]  [<ffffffffa034cd60>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.953079]  [<ffffffff810703fe>] kthread+0x9e/0xb0
[91128.958532]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
[91128.965133]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
[91128.972913]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
[91128.978826] ---[ end trace b8c31966cca731fb ]---

I'm able to reproduce this with ceph on a single server with 4 disks
(4 filesystems/osds) and a small test program based on librbd. It is
simply writing random bytes on a rbd volume (see attachment).

Is this something I should care about? Any hint's on solving this
would be appreciated.

Thanks,
Christian

[-- Attachment #2: rbdtest.c --]
[-- Type: text/x-csrc, Size: 1439 bytes --]

#include <inttypes.h>
#include <rbd/librbd.h>
#include <stdio.h>
#include <signal.h>

int nr_writes=0;

void
alarm_handler(int sig) {
        fprintf(stderr, "Writes/sec: %i\n", nr_writes/10);
	nr_writes = 0;
	alarm(10);
}


int main(int argc, char *argv[]) {
    char *clientname;
    rados_t cluster;
    rados_ioctx_t io_ctx;
    rbd_image_t image;
    char *pool = "rbd";
    char *imgname = argv[1];
	
    if (rados_create(&cluster, NULL) < 0) {
        fprintf(stderr, "error initializing");
        return 1;
    }

    rados_conf_read_file(cluster, NULL);
	
    if (rados_connect(cluster) < 0) {
        fprintf(stderr, "error connecting");
        rados_shutdown(cluster);
        return 1;
    }

    if (rados_ioctx_create(cluster, pool, &io_ctx) < 0) {
        fprintf(stderr, "error opening pool %s", pool);
        rados_shutdown(cluster);
        return 1;
    }

    int r = rbd_open(io_ctx, imgname, &image, NULL);
    if (r < 0) {
        fprintf(stderr, "error reading header from %s", imgname);
        rados_ioctx_destroy(io_ctx);
        rados_shutdown(cluster);
        return 1;
    }

    alarm(10);
    (void) signal(SIGALRM, alarm_handler);
    
    while(1) {
#define RAND_MAX 10485760
       int start = rand();
       rbd_write(image, start, 1, "a");
       nr_writes++;
    }
    
    rados_ioctx_destroy(io_ctx);
    rados_shutdown(cluster);
}

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-20 15:09 Ceph on btrfs 3.4rc Christian Brunner
@ 2012-04-23  7:20   ` Christian Brunner
  2012-04-24 15:21 ` Josef Bacik
  2012-04-29 21:09 ` tsuna
  2 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-23  7:20 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

I decided to run the test over the weekend. The good news is, that the
system is still running without performance degradation. But in the
meantime I've got over 5000 WARNINGs of this kind:

[330700.043557] btrfs: block rsv returned -28
[330700.043559] ------------[ cut here ]------------
[330700.048898] WARNING: at fs/btrfs/extent-tree.c:6220
btrfs_alloc_free_block+0x357/0x370 [btrfs]()
[330700.058880] Hardware name: ProLiant DL180 G6
[330700.064044] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[330700.090361] Pid: 7954, comm: btrfs-endio-wri Tainted: P        W
O 3.3.2-1.fits.1.el6.x86_64 #1
[330700.100393] Call Trace:
[330700.103263]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[330700.110201]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[330700.116905]  [<ffffffffa03436f7>] btrfs_alloc_free_block+0x357/0x37=
0 [btrfs]
[330700.124988]  [<ffffffffa0330eb0>] ? __btrfs_cow_block+0x330/0x530 [=
btrfs]
[330700.132787]  [<ffffffffa0398174>] ?
btrfs_add_delayed_data_ref+0x64/0x1c0 [btrfs]
[330700.141369]  [<ffffffffa0372d8b>] ? read_extent_buffer+0xbb/0x120 [=
btrfs]
[330700.149194]  [<ffffffffa0365d6d>] ?
btrfs_token_item_offset+0x5d/0xe0 [btrfs]
[330700.157373]  [<ffffffffa0330cb3>] __btrfs_cow_block+0x133/0x530 [bt=
rfs]
[330700.165023]  [<ffffffffa032f2ed>] ?
read_block_for_search+0x14d/0x3d0 [btrfs]
[330700.173183]  [<ffffffffa0331684>] btrfs_cow_block+0xf4/0x1f0 [btrfs=
]
[330700.180552]  [<ffffffffa03344b8>] btrfs_search_slot+0x3e8/0x8e0 [bt=
rfs]
[330700.188128]  [<ffffffffa03469f4>] btrfs_lookup_csum+0x74/0x170 [btr=
fs]
[330700.195634]  [<ffffffff811589e5>] ? kmem_cache_alloc+0x105/0x130
[330700.202551]  [<ffffffffa03477e0>] btrfs_csum_file_blocks+0xd0/0x6d0=
 [btrfs]
[330700.210542]  [<ffffffffa03768b1>] ? clear_extent_bit+0x161/0x420 [b=
trfs]
[330700.218237]  [<ffffffffa0354109>] add_pending_csums+0x49/0x70 [btrf=
s]
[330700.225706]  [<ffffffffa0357de6>]
btrfs_finish_ordered_io+0x276/0x3d0 [btrfs]
[330700.233940]  [<ffffffffa0357f8c>]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[330700.242345]  [<ffffffffa0376cb9>] end_extent_writepage+0x69/0x100 [=
btrfs]
[330700.250192]  [<ffffffffa0376db6>] end_bio_extent_writepage+0x66/0xa=
0 [btrfs]
[330700.258327]  [<ffffffff8119959d>] bio_endio+0x1d/0x40
[330700.264214]  [<ffffffffa034b135>] end_workqueue_fn+0x45/0x50 [btrfs=
]
[330700.271612]  [<ffffffffa03831df>] worker_loop+0x14f/0x5a0 [btrfs]
[330700.278672]  [<ffffffffa0383090>] ? btrfs_queue_worker+0x300/0x300 =
[btrfs]
[330700.286582]  [<ffffffffa0383090>] ? btrfs_queue_worker+0x300/0x300 =
[btrfs]
[330700.294535]  [<ffffffff810703fe>] kthread+0x9e/0xb0
[330700.300244]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
[330700.307031]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0=
x70/0x70
[330700.315061]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
[330700.321167] ---[ end trace b8c31966cca74ca0 ]---

The filesystems have plenty of free space:

/dev/sda              1.9T   16G  1.8T   1% /ceph/osd.000
/dev/sdb              1.9T   15G  1.8T   1% /ceph/osd.001
/dev/sdc              1.9T   13G  1.8T   1% /ceph/osd.002
/dev/sdd              1.9T   14G  1.8T   1% /ceph/osd.003

# btrfs fi df /ceph/osd.000
Data: total=3D38.01GB, used=3D15.53GB
System, DUP: total=3D8.00MB, used=3D64.00KB
System: total=3D4.00MB, used=3D0.00
Metadata, DUP: total=3D37.50GB, used=3D82.19MB
Metadata: total=3D8.00MB, used=3D0.00

A few more btrfs_orphan_commit_root WARNINGS are present too. If
needed I could upload the messages file.

Regards,
Christian

Am 20. April 2012 17:09 schrieb Christian Brunner <christian@brunner-mu=
c.de>:
> After running ceph on XFS for some time, I decided to try btrfs again=
=2E
> Performance with the current "for-linux-min" branch and big metadata
> is much better. The only problem (?) I'm still seeing is a warning
> that seems to occur from time to time:
>
> [87703.784552] ------------[ cut here ]------------
> [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [87703.799070] Hardware name: ProLiant DL180 G6
> [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P =A0 =A0 =A0 =A0=
 =A0 O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [87703.837513] Call Trace:
> [87703.840280] =A0[<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [87703.847016] =A0[<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [87703.853533] =A0[<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/=
0x100 [btrfs]
> [87703.861541] =A0[<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [bt=
rfs]
> [87703.868674] =A0[<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [87703.876745] =A0[<ffffffff810127a3>] ? __switch_to+0x153/0x440
> [87703.882966] =A0[<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [87703.888997] =A0[<ffffffffa0352040>] ?
> btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> [87703.897271] =A0[<ffffffffa035205f>] do_async_commit+0x1f/0x30 [btr=
fs]
> [87703.904262] =A0[<ffffffff81068949>] process_one_work+0x129/0x450
> [87703.910777] =A0[<ffffffff8106b7eb>] worker_thread+0x17b/0x3c0
> [87703.916991] =A0[<ffffffff8106b670>] ? manage_workers+0x220/0x220
> [87703.923504] =A0[<ffffffff810703fe>] kthread+0x9e/0xb0
> [87703.928952] =A0[<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [87703.935555] =A0[<ffffffff81070360>] ? kthread_freezable_should_sto=
p+0x70/0x70
> [87703.943323] =A0[<ffffffff8158c220>] ? gs_change+0x13/0x13
> [87703.949149] ---[ end trace b8c31966cca731fa ]---
> [91128.812399] ------------[ cut here ]------------
> [91128.817576] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [91128.826930] Hardware name: ProLiant DL180 G6
> [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: P =A0 =A0 =A0=
 =A0W =A0O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [91128.865912] Call Trace:
> [91128.868670] =A0[<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [91128.875379] =A0[<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [91128.881900] =A0[<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/=
0x100 [btrfs]
> [91128.889894] =A0[<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [bt=
rfs]
> [91128.897019] =A0[<ffffffffa03a2b61>] ?
> btrfs_run_delayed_items+0xf1/0x160 [btrfs]
> [91128.905075] =A0[<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [91128.913156] =A0[<ffffffffa03524b2>] ? start_transaction+0x92/0x310=
 [btrfs]
> [91128.920643] =A0[<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [91128.926667] =A0[<ffffffffa034cfcb>] transaction_kthread+0x26b/0x2e=
0 [btrfs]
> [91128.934254] =A0[<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.943671] =A0[<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.953079] =A0[<ffffffff810703fe>] kthread+0x9e/0xb0
> [91128.958532] =A0[<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [91128.965133] =A0[<ffffffff81070360>] ? kthread_freezable_should_sto=
p+0x70/0x70
> [91128.972913] =A0[<ffffffff8158c220>] ? gs_change+0x13/0x13
> [91128.978826] ---[ end trace b8c31966cca731fb ]---
>
> I'm able to reproduce this with ceph on a single server with 4 disks
> (4 filesystems/osds) and a small test program based on librbd. It is
> simply writing random bytes on a rbd volume (see attachment).
>
> Is this something I should care about? Any hint's on solving this
> would be appreciated.
>
> Thanks,
> Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-04-23  7:20   ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-23  7:20 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ceph-devel

I decided to run the test over the weekend. The good news is, that the
system is still running without performance degradation. But in the
meantime I've got over 5000 WARNINGs of this kind:

[330700.043557] btrfs: block rsv returned -28
[330700.043559] ------------[ cut here ]------------
[330700.048898] WARNING: at fs/btrfs/extent-tree.c:6220
btrfs_alloc_free_block+0x357/0x370 [btrfs]()
[330700.058880] Hardware name: ProLiant DL180 G6
[330700.064044] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[330700.090361] Pid: 7954, comm: btrfs-endio-wri Tainted: P        W
O 3.3.2-1.fits.1.el6.x86_64 #1
[330700.100393] Call Trace:
[330700.103263]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[330700.110201]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[330700.116905]  [<ffffffffa03436f7>] btrfs_alloc_free_block+0x357/0x370 [btrfs]
[330700.124988]  [<ffffffffa0330eb0>] ? __btrfs_cow_block+0x330/0x530 [btrfs]
[330700.132787]  [<ffffffffa0398174>] ?
btrfs_add_delayed_data_ref+0x64/0x1c0 [btrfs]
[330700.141369]  [<ffffffffa0372d8b>] ? read_extent_buffer+0xbb/0x120 [btrfs]
[330700.149194]  [<ffffffffa0365d6d>] ?
btrfs_token_item_offset+0x5d/0xe0 [btrfs]
[330700.157373]  [<ffffffffa0330cb3>] __btrfs_cow_block+0x133/0x530 [btrfs]
[330700.165023]  [<ffffffffa032f2ed>] ?
read_block_for_search+0x14d/0x3d0 [btrfs]
[330700.173183]  [<ffffffffa0331684>] btrfs_cow_block+0xf4/0x1f0 [btrfs]
[330700.180552]  [<ffffffffa03344b8>] btrfs_search_slot+0x3e8/0x8e0 [btrfs]
[330700.188128]  [<ffffffffa03469f4>] btrfs_lookup_csum+0x74/0x170 [btrfs]
[330700.195634]  [<ffffffff811589e5>] ? kmem_cache_alloc+0x105/0x130
[330700.202551]  [<ffffffffa03477e0>] btrfs_csum_file_blocks+0xd0/0x6d0 [btrfs]
[330700.210542]  [<ffffffffa03768b1>] ? clear_extent_bit+0x161/0x420 [btrfs]
[330700.218237]  [<ffffffffa0354109>] add_pending_csums+0x49/0x70 [btrfs]
[330700.225706]  [<ffffffffa0357de6>]
btrfs_finish_ordered_io+0x276/0x3d0 [btrfs]
[330700.233940]  [<ffffffffa0357f8c>]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[330700.242345]  [<ffffffffa0376cb9>] end_extent_writepage+0x69/0x100 [btrfs]
[330700.250192]  [<ffffffffa0376db6>] end_bio_extent_writepage+0x66/0xa0 [btrfs]
[330700.258327]  [<ffffffff8119959d>] bio_endio+0x1d/0x40
[330700.264214]  [<ffffffffa034b135>] end_workqueue_fn+0x45/0x50 [btrfs]
[330700.271612]  [<ffffffffa03831df>] worker_loop+0x14f/0x5a0 [btrfs]
[330700.278672]  [<ffffffffa0383090>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.286582]  [<ffffffffa0383090>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.294535]  [<ffffffff810703fe>] kthread+0x9e/0xb0
[330700.300244]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
[330700.307031]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
[330700.315061]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
[330700.321167] ---[ end trace b8c31966cca74ca0 ]---

The filesystems have plenty of free space:

/dev/sda              1.9T   16G  1.8T   1% /ceph/osd.000
/dev/sdb              1.9T   15G  1.8T   1% /ceph/osd.001
/dev/sdc              1.9T   13G  1.8T   1% /ceph/osd.002
/dev/sdd              1.9T   14G  1.8T   1% /ceph/osd.003

# btrfs fi df /ceph/osd.000
Data: total=38.01GB, used=15.53GB
System, DUP: total=8.00MB, used=64.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=37.50GB, used=82.19MB
Metadata: total=8.00MB, used=0.00

A few more btrfs_orphan_commit_root WARNINGS are present too. If
needed I could upload the messages file.

Regards,
Christian

Am 20. April 2012 17:09 schrieb Christian Brunner <christian@brunner-muc.de>:
> After running ceph on XFS for some time, I decided to try btrfs again.
> Performance with the current "for-linux-min" branch and big metadata
> is much better. The only problem (?) I'm still seeing is a warning
> that seems to occur from time to time:
>
> [87703.784552] ------------[ cut here ]------------
> [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [87703.799070] Hardware name: ProLiant DL180 G6
> [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [87703.837513] Call Trace:
> [87703.840280]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [87703.847016]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [87703.853533]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> [87703.861541]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [87703.868674]  [<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [87703.876745]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
> [87703.882966]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [87703.888997]  [<ffffffffa0352040>] ?
> btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> [87703.897271]  [<ffffffffa035205f>] do_async_commit+0x1f/0x30 [btrfs]
> [87703.904262]  [<ffffffff81068949>] process_one_work+0x129/0x450
> [87703.910777]  [<ffffffff8106b7eb>] worker_thread+0x17b/0x3c0
> [87703.916991]  [<ffffffff8106b670>] ? manage_workers+0x220/0x220
> [87703.923504]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> [87703.928952]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [87703.935555]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> [87703.943323]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> [87703.949149] ---[ end trace b8c31966cca731fa ]---
> [91128.812399] ------------[ cut here ]------------
> [91128.817576] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [91128.826930] Hardware name: ProLiant DL180 G6
> [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: P        W  O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [91128.865912] Call Trace:
> [91128.868670]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [91128.875379]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [91128.881900]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> [91128.889894]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [91128.897019]  [<ffffffffa03a2b61>] ?
> btrfs_run_delayed_items+0xf1/0x160 [btrfs]
> [91128.905075]  [<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [91128.913156]  [<ffffffffa03524b2>] ? start_transaction+0x92/0x310 [btrfs]
> [91128.920643]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [91128.926667]  [<ffffffffa034cfcb>] transaction_kthread+0x26b/0x2e0 [btrfs]
> [91128.934254]  [<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.943671]  [<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.953079]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> [91128.958532]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [91128.965133]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> [91128.972913]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> [91128.978826] ---[ end trace b8c31966cca731fb ]---
>
> I'm able to reproduce this with ceph on a single server with 4 disks
> (4 filesystems/osds) and a small test program based on librbd. It is
> simply writing random bytes on a rbd volume (see attachment).
>
> Is this something I should care about? Any hint's on solving this
> would be appreciated.
>
> Thanks,
> Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-20 15:09 Ceph on btrfs 3.4rc Christian Brunner
  2012-04-23  7:20   ` Christian Brunner
@ 2012-04-24 15:21 ` Josef Bacik
  2012-04-24 16:26   ` Sage Weil
  2012-04-29 21:09 ` tsuna
  2 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-04-24 15:21 UTC (permalink / raw)
  To: Christian Brunner; +Cc: linux-btrfs, ceph-devel

On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> After running ceph on XFS for some time, I decided to try btrfs again.
> Performance with the current "for-linux-min" branch and big metadata
> is much better. The only problem (?) I'm still seeing is a warning
> that seems to occur from time to time:
> 
> [87703.784552] ------------[ cut here ]------------
> [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [87703.799070] Hardware name: ProLiant DL180 G6
> [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [87703.837513] Call Trace:
> [87703.840280]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [87703.847016]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [87703.853533]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> [87703.861541]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [87703.868674]  [<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [87703.876745]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
> [87703.882966]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [87703.888997]  [<ffffffffa0352040>] ?
> btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> [87703.897271]  [<ffffffffa035205f>] do_async_commit+0x1f/0x30 [btrfs]
> [87703.904262]  [<ffffffff81068949>] process_one_work+0x129/0x450
> [87703.910777]  [<ffffffff8106b7eb>] worker_thread+0x17b/0x3c0
> [87703.916991]  [<ffffffff8106b670>] ? manage_workers+0x220/0x220
> [87703.923504]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> [87703.928952]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [87703.935555]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> [87703.943323]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> [87703.949149] ---[ end trace b8c31966cca731fa ]---
> [91128.812399] ------------[ cut here ]------------
> [91128.817576] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [91128.826930] Hardware name: ProLiant DL180 G6
> [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: P        W  O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [91128.865912] Call Trace:
> [91128.868670]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> [91128.875379]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> [91128.881900]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> [91128.889894]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [91128.897019]  [<ffffffffa03a2b61>] ?
> btrfs_run_delayed_items+0xf1/0x160 [btrfs]
> [91128.905075]  [<ffffffffa0351bcb>]
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [91128.913156]  [<ffffffffa03524b2>] ? start_transaction+0x92/0x310 [btrfs]
> [91128.920643]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> [91128.926667]  [<ffffffffa034cfcb>] transaction_kthread+0x26b/0x2e0 [btrfs]
> [91128.934254]  [<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.943671]  [<ffffffffa034cd60>] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.953079]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> [91128.958532]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> [91128.965133]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> [91128.972913]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> [91128.978826] ---[ end trace b8c31966cca731fb ]---
> 
> I'm able to reproduce this with ceph on a single server with 4 disks
> (4 filesystems/osds) and a small test program based on librbd. It is
> simply writing random bytes on a rbd volume (see attachment).
> 
> Is this something I should care about? Any hint's on solving this
> would be appreciated.
> 

Can you send me a config or some basic steps for me to setup ceph on my box so I
can run this program and finally track down this problem?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-24 15:21 ` Josef Bacik
@ 2012-04-24 16:26   ` Sage Weil
  2012-04-24 17:33     ` Josef Bacik
                       ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Sage Weil @ 2012-04-24 16:26 UTC (permalink / raw)
  To: Christian Brunner, Josef Bacik; +Cc: linux-btrfs, ceph-devel

On Tue, 24 Apr 2012, Josef Bacik wrote:
> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > After running ceph on XFS for some time, I decided to try btrfs again.
> > Performance with the current "for-linux-min" branch and big metadata
> > is much better. The only problem (?) I'm still seeing is a warning
> > that seems to occur from time to time:

Actually, before you do that... we have a new tool, 
test_filestore_workloadgen, that generates a ceph-osd-like workload on the 
local file system.  It's a subset of what a full OSD might do, but if 
we're lucky it will be sufficient to reproduce this issue.  Something like

 test_filestore_workloadgen --osd-data /foo --osd-journal /bar

will hopefully do the trick.

Christian, maybe you can see if that is able to trigger this warning?  
You'll need to pull it from the current master branch; it wasn't in the 
last release.

Thanks!
sage


> > 
> > [87703.784552] ------------[ cut here ]------------
> > [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> > btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> > [87703.799070] Hardware name: ProLiant DL180 G6
> > [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> > exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> > iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> > iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> > [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
> > 3.3.2-1.fits.1.el6.x86_64 #1
> > [87703.837513] Call Trace:
> > [87703.840280]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> > [87703.847016]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> > [87703.853533]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> > [87703.861541]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> > [87703.868674]  [<ffffffffa0351bcb>]
> > btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> > [87703.876745]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
> > [87703.882966]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> > [87703.888997]  [<ffffffffa0352040>] ?
> > btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> > [87703.897271]  [<ffffffffa035205f>] do_async_commit+0x1f/0x30 [btrfs]
> > [87703.904262]  [<ffffffff81068949>] process_one_work+0x129/0x450
> > [87703.910777]  [<ffffffff8106b7eb>] worker_thread+0x17b/0x3c0
> > [87703.916991]  [<ffffffff8106b670>] ? manage_workers+0x220/0x220
> > [87703.923504]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> > [87703.928952]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> > [87703.935555]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> > [87703.943323]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> > [87703.949149] ---[ end trace b8c31966cca731fa ]---
> > [91128.812399] ------------[ cut here ]------------
> > [91128.817576] WARNING: at fs/btrfs/inode.c:2103
> > btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> > [91128.826930] Hardware name: ProLiant DL180 G6
> > [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> > exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> > iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> > iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> > [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: P        W  O
> > 3.3.2-1.fits.1.el6.x86_64 #1
> > [91128.865912] Call Trace:
> > [91128.868670]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
> > [91128.875379]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
> > [91128.881900]  [<ffffffffa0355686>] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
> > [91128.889894]  [<ffffffffa0350a06>] commit_fs_roots+0xc6/0x1c0 [btrfs]
> > [91128.897019]  [<ffffffffa03a2b61>] ?
> > btrfs_run_delayed_items+0xf1/0x160 [btrfs]
> > [91128.905075]  [<ffffffffa0351bcb>]
> > btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> > [91128.913156]  [<ffffffffa03524b2>] ? start_transaction+0x92/0x310 [btrfs]
> > [91128.920643]  [<ffffffff81070a90>] ? wake_up_bit+0x40/0x40
> > [91128.926667]  [<ffffffffa034cfcb>] transaction_kthread+0x26b/0x2e0 [btrfs]
> > [91128.934254]  [<ffffffffa034cd60>] ?
> > btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> > [91128.943671]  [<ffffffffa034cd60>] ?
> > btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> > [91128.953079]  [<ffffffff810703fe>] kthread+0x9e/0xb0
> > [91128.958532]  [<ffffffff8158c224>] kernel_thread_helper+0x4/0x10
> > [91128.965133]  [<ffffffff81070360>] ? kthread_freezable_should_stop+0x70/0x70
> > [91128.972913]  [<ffffffff8158c220>] ? gs_change+0x13/0x13
> > [91128.978826] ---[ end trace b8c31966cca731fb ]---
> > 
> > I'm able to reproduce this with ceph on a single server with 4 disks
> > (4 filesystems/osds) and a small test program based on librbd. It is
> > simply writing random bytes on a rbd volume (see attachment).
> > 
> > Is this something I should care about? Any hint's on solving this
> > would be appreciated.
> > 
> 
> Can you send me a config or some basic steps for me to setup ceph on my box so I
> can run this program and finally track down this problem?  Thanks,
> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-24 16:26   ` Sage Weil
@ 2012-04-24 17:33     ` Josef Bacik
  2012-04-24 17:41       ` Neil Horman
  2012-04-25 11:28     ` Christian Brunner
  2012-04-27 11:02     ` Christian Brunner
  2 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-04-24 17:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: Christian Brunner, Josef Bacik, linux-btrfs, ceph-devel

On Tue, Apr 24, 2012 at 09:26:15AM -0700, Sage Weil wrote:
> On Tue, 24 Apr 2012, Josef Bacik wrote:
> > On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > > After running ceph on XFS for some time, I decided to try btrfs again.
> > > Performance with the current "for-linux-min" branch and big metadata
> > > is much better. The only problem (?) I'm still seeing is a warning
> > > that seems to occur from time to time:
> 
> Actually, before you do that... we have a new tool, 
> test_filestore_workloadgen, that generates a ceph-osd-like workload on the 
> local file system.  It's a subset of what a full OSD might do, but if 
> we're lucky it will be sufficient to reproduce this issue.  Something like
> 
>  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> 
> will hopefully do the trick.
> 
> Christian, maybe you can see if that is able to trigger this warning?  
> You'll need to pull it from the current master branch; it wasn't in the 
> last release.
> 

Keep up the good work Sage, at this rate I'll never have to setup ceph for
myself :),

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-24 17:33     ` Josef Bacik
@ 2012-04-24 17:41       ` Neil Horman
  0 siblings, 0 replies; 66+ messages in thread
From: Neil Horman @ 2012-04-24 17:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, Christian Brunner, linux-btrfs, ceph-devel

On Tue, Apr 24, 2012 at 01:33:44PM -0400, Josef Bacik wrote:
> On Tue, Apr 24, 2012 at 09:26:15AM -0700, Sage Weil wrote:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> > > On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > > > After running ceph on XFS for some time, I decided to try btrfs again.
> > > > Performance with the current "for-linux-min" branch and big metadata
> > > > is much better. The only problem (?) I'm still seeing is a warning
> > > > that seems to occur from time to time:
> > 
> > Actually, before you do that... we have a new tool, 
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the 
> > local file system.  It's a subset of what a full OSD might do, but if 
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> > 
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> > 
> > will hopefully do the trick.
> > 
> > Christian, maybe you can see if that is able to trigger this warning?  
> > You'll need to pull it from the current master branch; it wasn't in the 
> > last release.
> > 
> 
> Keep up the good work Sage, at this rate I'll never have to setup ceph for
> myself :),
> 
You can setup another OSD on daedalus if you're looking for something to do
Josef :)
Neil

> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-24 16:26   ` Sage Weil
  2012-04-24 17:33     ` Josef Bacik
@ 2012-04-25 11:28     ` Christian Brunner
  2012-04-25 12:16       ` João Eduardo Luís
  2012-04-27 11:02     ` Christian Brunner
  2 siblings, 1 reply; 66+ messages in thread
From: Christian Brunner @ 2012-04-25 11:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josef Bacik, ceph-devel

Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> On Tue, 24 Apr 2012, Josef Bacik wrote:
>> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> > After running ceph on XFS for some time, I decided to try btrfs again.
>> > Performance with the current "for-linux-min" branch and big metadata
>> > is much better. The only problem (?) I'm still seeing is a warning
>> > that seems to occur from time to time:
>
> Actually, before you do that... we have a new tool,
> test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> local file system.  It's a subset of what a full OSD might do, but if
> we're lucky it will be sufficient to reproduce this issue.  Something like
>
>  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>
> will hopefully do the trick.
>
> Christian, maybe you can see if that is able to trigger this warning?
> You'll need to pull it from the current master branch; it wasn't in the
> last release.

I've tried test_filestore_workloadgen, but it didn't work very well.
After 5 Minutes it terminated with the following messages:

2012-04-25 11:28:57.768709 7fcac3f69760  0 Destroying collection
'0.1_head' (358 objects)
2012-04-25 11:29:07.478747 7fcac3f69760  0 Destroying collection
'0.22_head' (477 objects)
2012-04-25 11:29:07.479035 7fcac3f69760  0 We ran out of collections!
2012-04-25 11:29:07.916149 7fcac3f69760  1 journal close
/dev/mapper/vg01-lv_osd_journal_0

I guess this is not btrfs related .-)

@Josef: If Sage doesn't have any other solution, I'll put up some ceph
installation instructions.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-25 11:28     ` Christian Brunner
@ 2012-04-25 12:16       ` João Eduardo Luís
  0 siblings, 0 replies; 66+ messages in thread
From: João Eduardo Luís @ 2012-04-25 12:16 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On 04/25/2012 12:28 PM, Christian Brunner wrote:
> I've tried test_filestore_workloadgen, but it didn't work very well.
> After 5 Minutes it terminated with the following messages:
> 
> 2012-04-25 11:28:57.768709 7fcac3f69760  0 Destroying collection
> '0.1_head' (358 objects)
> 2012-04-25 11:29:07.478747 7fcac3f69760  0 Destroying collection
> '0.22_head' (477 objects)
> 2012-04-25 11:29:07.479035 7fcac3f69760  0 We ran out of collections!
> 2012-04-25 11:29:07.916149 7fcac3f69760  1 journal close
> /dev/mapper/vg01-lv_osd_journal_0
> 
> I guess this is not btrfs related .-)
> 

Oh yeah... the "solution" for that is to set '--test-num-colls' to a
greater value (the test creates 30 by default, and destroys them every
'X' transactions). Increase the number of collections and it will have a
longer run.

But since that's sounding like a workaround over something that could
easily be avoided, I'll make a patch for the test to avoid that from
happening.

-- 
João Eduardo Luís
gpg key: 477C26E5 from pool.keyserver.eu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 554 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-24 16:26   ` Sage Weil
  2012-04-24 17:33     ` Josef Bacik
  2012-04-25 11:28     ` Christian Brunner
@ 2012-04-27 11:02     ` Christian Brunner
  2012-05-03 14:13         ` Josef Bacik
                         ` (2 more replies)
  2 siblings, 3 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-27 11:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josef Bacik, linux-btrfs, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> On Tue, 24 Apr 2012, Josef Bacik wrote:
>> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> > After running ceph on XFS for some time, I decided to try btrfs again.
>> > Performance with the current "for-linux-min" branch and big metadata
>> > is much better. The only problem (?) I'm still seeing is a warning
>> > that seems to occur from time to time:
>
> Actually, before you do that... we have a new tool,
> test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> local file system.  It's a subset of what a full OSD might do, but if
> we're lucky it will be sufficient to reproduce this issue.  Something like
>
>  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>
> will hopefully do the trick.
>
> Christian, maybe you can see if that is able to trigger this warning?
> You'll need to pull it from the current master branch; it wasn't in the
> last release.

Trying to reproduce with test_filestore_workloadgen didn't work for
me. So here are some instructions on how to reproduce with a minimal
ceph setup.

You will need a single system with two disks and a bit of memory.

- Compile and install ceph (detailed instructions:
http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)

- For the test setup I've used two tmpfs files as journal devices. To
create these, do the following:

# mkdir -p /ceph/temp
# mount -t tmpfs tmpfs /ceph/temp
# dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
# dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k

- Now you should create and mount btrfs. Here is what I did:

# mkfs.btrfs -l 64k -n 64k /dev/sda
# mkfs.btrfs -l 64k -n 64k /dev/sdb
# mkdir /ceph/osd.000
# mkdir /ceph/osd.001
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001

- Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
will probably have to change the btrfs devices and the hostname
(os39).

- Create the ceph filesystems:

# mkdir /ceph/mon
# mkcephfs -a -c /etc/ceph/ceph.conf

- Start ceph (e.g. "service ceph start")

- Now you should be able to use ceph - "ceph -s" will tell you about
the state of the ceph cluster.

- "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.

- Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
with "./rbdtest testimg".

I can see the first btrfs_orphan_commit_root warning after an hour or
so... I hope that I've described all necessary steps. If there is a
problem just send me a note.

Thanks,
Christian

[-- Attachment #2: ceph.conf --]
[-- Type: application/octet-stream, Size: 610 bytes --]

[global]
        auth supported = none 
	debug optracker = 0 ; to work arround a log problem in ceph 0.45

[mon]
        mon data = /ceph/mon

[mon.0]
        host = os39
        mon addr = 127.0.0.1:6789

[osd]
        osd data = /ceph/osd.$id
	osd class dir = /usr/lib64/rados-classes

[osd.000]
        host = os39
        osd journal = /ceph/temp/journal0
        osd journal size = 500M
	journal dio = False
        btrfs devs = "/dev/sda"

[osd.001]
        host = os39
        osd journal = /ceph/temp/journal1
        osd journal size = 500M
        journal dio = False
        btrfs devs = "/dev/sdb"

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-20 15:09 Ceph on btrfs 3.4rc Christian Brunner
  2012-04-23  7:20   ` Christian Brunner
  2012-04-24 15:21 ` Josef Bacik
@ 2012-04-29 21:09 ` tsuna
  2012-04-30 10:28     ` Christian Brunner
  2 siblings, 1 reply; 66+ messages in thread
From: tsuna @ 2012-04-29 21:09 UTC (permalink / raw)
  To: Christian Brunner; +Cc: linux-btrfs, ceph-devel

On Fri, Apr 20, 2012 at 8:09 AM, Christian Brunner
<christian@brunner-muc.de> wrote:
> After running ceph on XFS for some time, I decided to try btrfs again.
> Performance with the current "for-linux-min" branch and big metadata
> is much better.

I've heard that although performance from btrfs is better at first, it
degrades over time due to metadata fragmentation, whereas XFS'
performance starts off a little worse, but remains stable even after
weeks of heavy utilization.  Would be curious to hear your (or
others') feedback on that topic.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-29 21:09 ` tsuna
@ 2012-04-30 10:28     ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-30 10:28 UTC (permalink / raw)
  To: tsuna; +Cc: linux-btrfs, ceph-devel

2012/4/29 tsuna <tsunanet@gmail.com>:
> On Fri, Apr 20, 2012 at 8:09 AM, Christian Brunner
> <christian@brunner-muc.de> wrote:
>> After running ceph on XFS for some time, I decided to try btrfs agai=
n.
>> Performance with the current "for-linux-min" branch and big metadata
>> is much better.
>
> I've heard that although performance from btrfs is better at first, i=
t
> degrades over time due to metadata fragmentation, whereas XFS'
> performance starts off a little worse, but remains stable even after
> weeks of heavy utilization. =A0Would be curious to hear your (or
> others') feedback on that topic.

Metadata fragmentation was a big problem (for us) in the past. With
the "big metatdata feature" (mkfs.btrfs -l 64k -n 64k) these problems
seem to be solved. We do not use it in production yet, but my stress
test didn't show any degradation. The only remaining issues I've seen
are these warnings.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-04-30 10:28     ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-04-30 10:28 UTC (permalink / raw)
  To: tsuna; +Cc: linux-btrfs, ceph-devel

2012/4/29 tsuna <tsunanet@gmail.com>:
> On Fri, Apr 20, 2012 at 8:09 AM, Christian Brunner
> <christian@brunner-muc.de> wrote:
>> After running ceph on XFS for some time, I decided to try btrfs again.
>> Performance with the current "for-linux-min" branch and big metadata
>> is much better.
>
> I've heard that although performance from btrfs is better at first, it
> degrades over time due to metadata fragmentation, whereas XFS'
> performance starts off a little worse, but remains stable even after
> weeks of heavy utilization.  Would be curious to hear your (or
> others') feedback on that topic.

Metadata fragmentation was a big problem (for us) in the past. With
the "big metatdata feature" (mkfs.btrfs -l 64k -n 64k) these problems
seem to be solved. We do not use it in production yet, but my stress
test didn't show any degradation. The only remaining issues I've seen
are these warnings.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-30 10:28     ` Christian Brunner
  (?)
@ 2012-04-30 10:54     ` Amon Ott
  -1 siblings, 0 replies; 66+ messages in thread
From: Amon Ott @ 2012-04-30 10:54 UTC (permalink / raw)
  To: ceph-devel

On Monday 30 April 2012 wrote Christian Brunner:
> Metadata fragmentation was a big problem (for us) in the past. With
> the "big metatdata feature" (mkfs.btrfs -l 64k -n 64k) these problems
> seem to be solved. We do not use it in production yet, but my stress
> test didn't show any degradation. The only remaining issues I've seen
> are these warnings.

Where exactly are the metadata stored, in the leaf (-l 64k) or in the node (-n 
64k)? I would like to avoid wasting space and disk bandwidth and only use one 
of these options, if that makes sense.

After reading the short man page I would have tried -n.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-27 11:02     ` Christian Brunner
@ 2012-05-03 14:13         ` Josef Bacik
  2012-05-10 17:40         ` Josef Bacik
  2012-05-10 20:35         ` Josef Bacik
  2 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 14:13 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs =
again.
> >> > Performance with the current "for-linux-min" branch and big meta=
data
> >> > is much better. The only problem (?) I'm still seeing is a warni=
ng
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload=
 on the
> > local file system. =A0It's a subset of what a full OSD might do, bu=
t if
> > we're lucky it will be sufficient to reproduce this issue. =A0Somet=
hing like
> >
> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warnin=
g?
> > You'll need to pull it from the current master branch; it wasn't in=
 the
> > last release.
>=20
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
>=20
> You will need a single system with two disks and a bit of memory.
>=20
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>=20
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
>=20
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024k
> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024k
>=20
> - Now you should create and mount btrfs. Here is what I did:
>=20
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/=
osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/=
osd.001
>=20
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
>=20
> - Create the ceph filesystems:
>=20
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
>=20
> - Start ceph (e.g. "service ceph start")
>=20
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
>=20
> - "rbd create -size 100 testimg" will create an rbd image on the ceph=
 cluster.
>=20

It's failing here

http://fpaste.org/e3BG/

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-03 14:13         ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 14:13 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 

It's failing here

http://fpaste.org/e3BG/

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-03 14:13         ` Josef Bacik
@ 2012-05-03 15:17           ` Josh Durgin
  -1 siblings, 0 replies; 66+ messages in thread
From: Josh Durgin @ 2012-05-03 15:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
wrote:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote=
:
>> >> > After running ceph on XFS for some time, I decided to try btrfs=
 again.
>> >> > Performance with the current "for-linux-min" branch and big met=
adata
>> >> > is much better. The only problem (?) I'm still seeing is a warn=
ing
>> >> > that seems to occur from time to time:
>> >
>> > Actually, before you do that... we have a new tool,
>> > test_filestore_workloadgen, that generates a ceph-osd-like workloa=
d on the
>> > local file system. =C2=A0It's a subset of what a full OSD might do=
, but if
>> > we're lucky it will be sufficient to reproduce this issue. =C2=A0S=
omething like
>> >
>> > =C2=A0test_filestore_workloadgen --osd-data /foo --osd-journal /ba=
r
>> >
>> > will hopefully do the trick.
>> >
>> > Christian, maybe you can see if that is able to trigger this warni=
ng?
>> > You'll need to pull it from the current master branch; it wasn't i=
n the
>> > last release.
>>
>> Trying to reproduce with test_filestore_workloadgen didn't work for
>> me. So here are some instructions on how to reproduce with a minimal
>> ceph setup.
>>
>> You will need a single system with two disks and a bit of memory.
>>
>> - Compile and install ceph (detailed instructions:
>> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>>
>> - For the test setup I've used two tmpfs files as journal devices. T=
o
>> create these, do the following:
>>
>> # mkdir -p /ceph/temp
>> # mount -t tmpfs tmpfs /ceph/temp
>> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024k
>> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024k
>>
>> - Now you should create and mount btrfs. Here is what I did:
>>
>> # mkfs.btrfs -l 64k -n 64k /dev/sda
>> # mkfs.btrfs -l 64k -n 64k /dev/sdb
>> # mkdir /ceph/osd.000
>> # mkdir /ceph/osd.001
>> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph=
/osd.000
>> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph=
/osd.001
>>
>> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
>> will probably have to change the btrfs devices and the hostname
>> (os39).
>>
>> - Create the ceph filesystems:
>>
>> # mkdir /ceph/mon
>> # mkcephfs -a -c /etc/ceph/ceph.conf
>>
>> - Start ceph (e.g. "service ceph start")
>>
>> - Now you should be able to use ceph - "ceph -s" will tell you about
>> the state of the ceph cluster.
>>
>> - "rbd create -size 100 testimg" will create an rbd image on the cep=
h cluster.
>>
>=20
> It's failing here
>=20
> http://fpaste.org/e3BG/

2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <=3D=3D
osd.1 127.0.0.1:6803/2379 3 =3D=3D=3D=3D osd_op_reply(3 rbd_info [call]=
 =3D -5
(Input/output error)) v4 =3D=3D=3D=3D 107+0+0 (3948821281 0 0) 0x7fcb38=
0009a0
con 0x1cad3e0

This is probably because the osd isn't finding the rbd class.
Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
rbd_cls.so is,
try adding 'osd class dir =3D /path/to/rados-classes' to the [osd]
section
in your ceph.conf, and restarting the osds.

If you set 'debug osd =3D 10' you should see '_load_class rbd' in the o=
sd
log
when you try to create an rbd image.

Autotools should be setting the default location correctly, but if
you're
running the osds in a chroot or something the path would be wrong.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-03 15:17           ` Josh Durgin
  0 siblings, 0 replies; 66+ messages in thread
From: Josh Durgin @ 2012-05-03 15:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
wrote:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> >> > After running ceph on XFS for some time, I decided to try btrfs again.
>> >> > Performance with the current "for-linux-min" branch and big metadata
>> >> > is much better. The only problem (?) I'm still seeing is a warning
>> >> > that seems to occur from time to time:
>> >
>> > Actually, before you do that... we have a new tool,
>> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
>> > local file system.  It's a subset of what a full OSD might do, but if
>> > we're lucky it will be sufficient to reproduce this issue.  Something like
>> >
>> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>> >
>> > will hopefully do the trick.
>> >
>> > Christian, maybe you can see if that is able to trigger this warning?
>> > You'll need to pull it from the current master branch; it wasn't in the
>> > last release.
>>
>> Trying to reproduce with test_filestore_workloadgen didn't work for
>> me. So here are some instructions on how to reproduce with a minimal
>> ceph setup.
>>
>> You will need a single system with two disks and a bit of memory.
>>
>> - Compile and install ceph (detailed instructions:
>> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>>
>> - For the test setup I've used two tmpfs files as journal devices. To
>> create these, do the following:
>>
>> # mkdir -p /ceph/temp
>> # mount -t tmpfs tmpfs /ceph/temp
>> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
>> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
>>
>> - Now you should create and mount btrfs. Here is what I did:
>>
>> # mkfs.btrfs -l 64k -n 64k /dev/sda
>> # mkfs.btrfs -l 64k -n 64k /dev/sdb
>> # mkdir /ceph/osd.000
>> # mkdir /ceph/osd.001
>> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
>> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
>>
>> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
>> will probably have to change the btrfs devices and the hostname
>> (os39).
>>
>> - Create the ceph filesystems:
>>
>> # mkdir /ceph/mon
>> # mkcephfs -a -c /etc/ceph/ceph.conf
>>
>> - Start ceph (e.g. "service ceph start")
>>
>> - Now you should be able to use ceph - "ceph -s" will tell you about
>> the state of the ceph cluster.
>>
>> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
>>
> 
> It's failing here
> 
> http://fpaste.org/e3BG/

2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
osd.1 127.0.0.1:6803/2379 3 ==== osd_op_reply(3 rbd_info [call] = -5
(Input/output error)) v4 ==== 107+0+0 (3948821281 0 0) 0x7fcb380009a0
con 0x1cad3e0

This is probably because the osd isn't finding the rbd class.
Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
rbd_cls.so is,
try adding 'osd class dir = /path/to/rados-classes' to the [osd]
section
in your ceph.conf, and restarting the osds.

If you set 'debug osd = 10' you should see '_load_class rbd' in the osd
log
when you try to create an rbd image.

Autotools should be setting the default location correctly, but if
you're
running the osds in a chroot or something the path would be wrong.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-03 15:17           ` Josh Durgin
@ 2012-05-03 15:20             ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 15:20 UTC (permalink / raw)
  To: Josh Durgin
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
> wrote:
> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wro=
te:
> >> >> > After running ceph on XFS for some time, I decided to try btr=
fs again.
> >> >> > Performance with the current "for-linux-min" branch and big m=
etadata
> >> >> > is much better. The only problem (?) I'm still seeing is a wa=
rning
> >> >> > that seems to occur from time to time:
> >> >
> >> > Actually, before you do that... we have a new tool,
> >> > test_filestore_workloadgen, that generates a ceph-osd-like workl=
oad on the
> >> > local file system. =A0It's a subset of what a full OSD might do,=
 but if
> >> > we're lucky it will be sufficient to reproduce this issue. =A0So=
mething like
> >> >
> >> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >> >
> >> > will hopefully do the trick.
> >> >
> >> > Christian, maybe you can see if that is able to trigger this war=
ning?
> >> > You'll need to pull it from the current master branch; it wasn't=
 in the
> >> > last release.
> >>
> >> Trying to reproduce with test_filestore_workloadgen didn't work fo=
r
> >> me. So here are some instructions on how to reproduce with a minim=
al
> >> ceph setup.
> >>
> >> You will need a single system with two disks and a bit of memory.
> >>
> >> - Compile and install ceph (detailed instructions:
> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >>
> >> - For the test setup I've used two tmpfs files as journal devices.=
 To
> >> create these, do the following:
> >>
> >> # mkdir -p /ceph/temp
> >> # mount -t tmpfs tmpfs /ceph/temp
> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024=
k
> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024=
k
> >>
> >> - Now you should create and mount btrfs. Here is what I did:
> >>
> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> # mkdir /ceph/osd.000
> >> # mkdir /ceph/osd.001
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ce=
ph/osd.000
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ce=
ph/osd.001
> >>
> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. Yo=
u
> >> will probably have to change the btrfs devices and the hostname
> >> (os39).
> >>
> >> - Create the ceph filesystems:
> >>
> >> # mkdir /ceph/mon
> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >>
> >> - Start ceph (e.g. "service ceph start")
> >>
> >> - Now you should be able to use ceph - "ceph -s" will tell you abo=
ut
> >> the state of the ceph cluster.
> >>
> >> - "rbd create -size 100 testimg" will create an rbd image on the c=
eph cluster.
> >>
> >=20
> > It's failing here
> >=20
> > http://fpaste.org/e3BG/
>=20
> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <=3D=3D
> osd.1 127.0.0.1:6803/2379 3 =3D=3D=3D=3D osd_op_reply(3 rbd_info [cal=
l] =3D -5
> (Input/output error)) v4 =3D=3D=3D=3D 107+0+0 (3948821281 0 0) 0x7fcb=
380009a0
> con 0x1cad3e0
>=20
> This is probably because the osd isn't finding the rbd class.
> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
> rbd_cls.so is,
> try adding 'osd class dir =3D /path/to/rados-classes' to the [osd]
> section
> in your ceph.conf, and restarting the osds.
>=20
> If you set 'debug osd =3D 10' you should see '_load_class rbd' in the=
 osd
> log
> when you try to create an rbd image.
>=20
> Autotools should be setting the default location correctly, but if
> you're
> running the osds in a chroot or something the path would be wrong.
>=20

Yeah all that was in the right place, I rebooted and I magically stoppe=
d getting
that error, but now I'm getting this

http://fpaste.org/OE92/

with that ping thing repeating over and over.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-03 15:20             ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 15:20 UTC (permalink / raw)
  To: Josh Durgin
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
> wrote:
> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> >> > Performance with the current "for-linux-min" branch and big metadata
> >> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> >> > that seems to occur from time to time:
> >> >
> >> > Actually, before you do that... we have a new tool,
> >> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> >> > local file system.  It's a subset of what a full OSD might do, but if
> >> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >> >
> >> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >> >
> >> > will hopefully do the trick.
> >> >
> >> > Christian, maybe you can see if that is able to trigger this warning?
> >> > You'll need to pull it from the current master branch; it wasn't in the
> >> > last release.
> >>
> >> Trying to reproduce with test_filestore_workloadgen didn't work for
> >> me. So here are some instructions on how to reproduce with a minimal
> >> ceph setup.
> >>
> >> You will need a single system with two disks and a bit of memory.
> >>
> >> - Compile and install ceph (detailed instructions:
> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >>
> >> - For the test setup I've used two tmpfs files as journal devices. To
> >> create these, do the following:
> >>
> >> # mkdir -p /ceph/temp
> >> # mount -t tmpfs tmpfs /ceph/temp
> >> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> >> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> >>
> >> - Now you should create and mount btrfs. Here is what I did:
> >>
> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> # mkdir /ceph/osd.000
> >> # mkdir /ceph/osd.001
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> >>
> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> >> will probably have to change the btrfs devices and the hostname
> >> (os39).
> >>
> >> - Create the ceph filesystems:
> >>
> >> # mkdir /ceph/mon
> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >>
> >> - Start ceph (e.g. "service ceph start")
> >>
> >> - Now you should be able to use ceph - "ceph -s" will tell you about
> >> the state of the ceph cluster.
> >>
> >> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> >>
> > 
> > It's failing here
> > 
> > http://fpaste.org/e3BG/
> 
> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
> osd.1 127.0.0.1:6803/2379 3 ==== osd_op_reply(3 rbd_info [call] = -5
> (Input/output error)) v4 ==== 107+0+0 (3948821281 0 0) 0x7fcb380009a0
> con 0x1cad3e0
> 
> This is probably because the osd isn't finding the rbd class.
> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
> rbd_cls.so is,
> try adding 'osd class dir = /path/to/rados-classes' to the [osd]
> section
> in your ceph.conf, and restarting the osds.
> 
> If you set 'debug osd = 10' you should see '_load_class rbd' in the osd
> log
> when you try to create an rbd image.
> 
> Autotools should be setting the default location correctly, but if
> you're
> running the osds in a chroot or something the path would be wrong.
> 

Yeah all that was in the right place, I rebooted and I magically stopped getting
that error, but now I'm getting this

http://fpaste.org/OE92/

with that ping thing repeating over and over.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-03 15:20             ` Josef Bacik
@ 2012-05-03 16:38               ` Josh Durgin
  -1 siblings, 0 replies; 66+ messages in thread
From: Josh Durgin @ 2012-05-03 16:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
wrote:
> On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
>> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
>> wrote:
>> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wr=
ote:
>> >> >> > After running ceph on XFS for some time, I decided to try bt=
rfs again.
>> >> >> > Performance with the current "for-linux-min" branch and big =
metadata
>> >> >> > is much better. The only problem (?) I'm still seeing is a w=
arning
>> >> >> > that seems to occur from time to time:
>> >> >
>> >> > Actually, before you do that... we have a new tool,
>> >> > test_filestore_workloadgen, that generates a ceph-osd-like work=
load on the
>> >> > local file system. =C2=A0It's a subset of what a full OSD might=
 do, but if
>> >> > we're lucky it will be sufficient to reproduce this issue. =C2=A0=
Something like
>> >> >
>> >> > =C2=A0test_filestore_workloadgen --osd-data /foo --osd-journal =
/bar
>> >> >
>> >> > will hopefully do the trick.
>> >> >
>> >> > Christian, maybe you can see if that is able to trigger this wa=
rning?
>> >> > You'll need to pull it from the current master branch; it wasn'=
t in the
>> >> > last release.
>> >>
>> >> Trying to reproduce with test_filestore_workloadgen didn't work f=
or
>> >> me. So here are some instructions on how to reproduce with a mini=
mal
>> >> ceph setup.
>> >>
>> >> You will need a single system with two disks and a bit of memory.
>> >>
>> >> - Compile and install ceph (detailed instructions:
>> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>> >>
>> >> - For the test setup I've used two tmpfs files as journal devices=
=2E To
>> >> create these, do the following:
>> >>
>> >> # mkdir -p /ceph/temp
>> >> # mount -t tmpfs tmpfs /ceph/temp
>> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D102=
4k
>> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D102=
4k
>> >>
>> >> - Now you should create and mount btrfs. Here is what I did:
>> >>
>> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
>> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
>> >> # mkdir /ceph/osd.000
>> >> # mkdir /ceph/osd.001
>> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /c=
eph/osd.000
>> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /c=
eph/osd.001
>> >>
>> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. Y=
ou
>> >> will probably have to change the btrfs devices and the hostname
>> >> (os39).
>> >>
>> >> - Create the ceph filesystems:
>> >>
>> >> # mkdir /ceph/mon
>> >> # mkcephfs -a -c /etc/ceph/ceph.conf
>> >>
>> >> - Start ceph (e.g. "service ceph start")
>> >>
>> >> - Now you should be able to use ceph - "ceph -s" will tell you ab=
out
>> >> the state of the ceph cluster.
>> >>
>> >> - "rbd create -size 100 testimg" will create an rbd image on the =
ceph cluster.
>> >>
>> >
>> > It's failing here
>> >
>> > http://fpaste.org/e3BG/
>>
>> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <=3D=3D
>> osd.1 127.0.0.1:6803/2379 3 =3D=3D=3D=3D osd_op_reply(3 rbd_info [ca=
ll] =3D -5
>> (Input/output error)) v4 =3D=3D=3D=3D 107+0+0 (3948821281 0 0) 0x7fc=
b380009a0
>> con 0x1cad3e0
>>
>> This is probably because the osd isn't finding the rbd class.
>> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
>> rbd_cls.so is,
>> try adding 'osd class dir =3D /path/to/rados-classes' to the [osd]
>> section
>> in your ceph.conf, and restarting the osds.
>>
>> If you set 'debug osd =3D 10' you should see '_load_class rbd' in th=
e osd
>> log
>> when you try to create an rbd image.
>>
>> Autotools should be setting the default location correctly, but if
>> you're
>> running the osds in a chroot or something the path would be wrong.
>>
>=20
> Yeah all that was in the right place, I rebooted and I magically
> stopped getting
> that error, but now I'm getting this
>=20
> http://fpaste.org/OE92/
>=20
> with that ping thing repeating over and over.  Thanks,

That just looks like the osd isn't running. If you restart the
osd with 'debug osd =3D 20' the osd log should tell us what's going on.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-03 16:38               ` Josh Durgin
  0 siblings, 0 replies; 66+ messages in thread
From: Josh Durgin @ 2012-05-03 16:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
wrote:
> On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
>> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
>> wrote:
>> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> >> >> > After running ceph on XFS for some time, I decided to try btrfs again.
>> >> >> > Performance with the current "for-linux-min" branch and big metadata
>> >> >> > is much better. The only problem (?) I'm still seeing is a warning
>> >> >> > that seems to occur from time to time:
>> >> >
>> >> > Actually, before you do that... we have a new tool,
>> >> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
>> >> > local file system.  It's a subset of what a full OSD might do, but if
>> >> > we're lucky it will be sufficient to reproduce this issue.  Something like
>> >> >
>> >> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>> >> >
>> >> > will hopefully do the trick.
>> >> >
>> >> > Christian, maybe you can see if that is able to trigger this warning?
>> >> > You'll need to pull it from the current master branch; it wasn't in the
>> >> > last release.
>> >>
>> >> Trying to reproduce with test_filestore_workloadgen didn't work for
>> >> me. So here are some instructions on how to reproduce with a minimal
>> >> ceph setup.
>> >>
>> >> You will need a single system with two disks and a bit of memory.
>> >>
>> >> - Compile and install ceph (detailed instructions:
>> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>> >>
>> >> - For the test setup I've used two tmpfs files as journal devices. To
>> >> create these, do the following:
>> >>
>> >> # mkdir -p /ceph/temp
>> >> # mount -t tmpfs tmpfs /ceph/temp
>> >> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
>> >> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
>> >>
>> >> - Now you should create and mount btrfs. Here is what I did:
>> >>
>> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
>> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
>> >> # mkdir /ceph/osd.000
>> >> # mkdir /ceph/osd.001
>> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
>> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
>> >>
>> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
>> >> will probably have to change the btrfs devices and the hostname
>> >> (os39).
>> >>
>> >> - Create the ceph filesystems:
>> >>
>> >> # mkdir /ceph/mon
>> >> # mkcephfs -a -c /etc/ceph/ceph.conf
>> >>
>> >> - Start ceph (e.g. "service ceph start")
>> >>
>> >> - Now you should be able to use ceph - "ceph -s" will tell you about
>> >> the state of the ceph cluster.
>> >>
>> >> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
>> >>
>> >
>> > It's failing here
>> >
>> > http://fpaste.org/e3BG/
>>
>> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
>> osd.1 127.0.0.1:6803/2379 3 ==== osd_op_reply(3 rbd_info [call] = -5
>> (Input/output error)) v4 ==== 107+0+0 (3948821281 0 0) 0x7fcb380009a0
>> con 0x1cad3e0
>>
>> This is probably because the osd isn't finding the rbd class.
>> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
>> rbd_cls.so is,
>> try adding 'osd class dir = /path/to/rados-classes' to the [osd]
>> section
>> in your ceph.conf, and restarting the osds.
>>
>> If you set 'debug osd = 10' you should see '_load_class rbd' in the osd
>> log
>> when you try to create an rbd image.
>>
>> Autotools should be setting the default location correctly, but if
>> you're
>> running the osds in a chroot or something the path would be wrong.
>>
> 
> Yeah all that was in the right place, I rebooted and I magically
> stopped getting
> that error, but now I'm getting this
> 
> http://fpaste.org/OE92/
> 
> with that ping thing repeating over and over.  Thanks,

That just looks like the osd isn't running. If you restart the
osd with 'debug osd = 20' the osd log should tell us what's going on.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-03 16:38               ` Josh Durgin
@ 2012-05-03 19:49                 ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 19:49 UTC (permalink / raw)
  To: Josh Durgin
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
> wrote:
> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
> >> wrote:
> >> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrot=
e:
> >> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> >> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner =
wrote:
> >> >> >> > After running ceph on XFS for some time, I decided to try =
btrfs again.
> >> >> >> > Performance with the current "for-linux-min" branch and bi=
g metadata
> >> >> >> > is much better. The only problem (?) I'm still seeing is a=
 warning
> >> >> >> > that seems to occur from time to time:
> >> >> >
> >> >> > Actually, before you do that... we have a new tool,
> >> >> > test_filestore_workloadgen, that generates a ceph-osd-like wo=
rkload on the
> >> >> > local file system. =A0It's a subset of what a full OSD might =
do, but if
> >> >> > we're lucky it will be sufficient to reproduce this issue. =A0=
Something like
> >> >> >
> >> >> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /=
bar
> >> >> >
> >> >> > will hopefully do the trick.
> >> >> >
> >> >> > Christian, maybe you can see if that is able to trigger this =
warning?
> >> >> > You'll need to pull it from the current master branch; it was=
n't in the
> >> >> > last release.
> >> >>
> >> >> Trying to reproduce with test_filestore_workloadgen didn't work=
 for
> >> >> me. So here are some instructions on how to reproduce with a mi=
nimal
> >> >> ceph setup.
> >> >>
> >> >> You will need a single system with two disks and a bit of memor=
y.
> >> >>
> >> >> - Compile and install ceph (detailed instructions:
> >> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >> >>
> >> >> - For the test setup I've used two tmpfs files as journal devic=
es. To
> >> >> create these, do the following:
> >> >>
> >> >> # mkdir -p /ceph/temp
> >> >> # mount -t tmpfs tmpfs /ceph/temp
> >> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1=
024k
> >> >> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1=
024k
> >> >>
> >> >> - Now you should create and mount btrfs. Here is what I did:
> >> >>
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> >> # mkdir /ceph/osd.000
> >> >> # mkdir /ceph/osd.001
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda =
/ceph/osd.000
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb =
/ceph/osd.001
> >> >>
> >> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf.=
 You
> >> >> will probably have to change the btrfs devices and the hostname
> >> >> (os39).
> >> >>
> >> >> - Create the ceph filesystems:
> >> >>
> >> >> # mkdir /ceph/mon
> >> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >> >>
> >> >> - Start ceph (e.g. "service ceph start")
> >> >>
> >> >> - Now you should be able to use ceph - "ceph -s" will tell you =
about
> >> >> the state of the ceph cluster.
> >> >>
> >> >> - "rbd create -size 100 testimg" will create an rbd image on th=
e ceph cluster.
> >> >>
> >> >
> >> > It's failing here
> >> >
> >> > http://fpaste.org/e3BG/
> >>
> >> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <=3D=
=3D
> >> osd.1 127.0.0.1:6803/2379 3 =3D=3D=3D=3D osd_op_reply(3 rbd_info [=
call] =3D -5
> >> (Input/output error)) v4 =3D=3D=3D=3D 107+0+0 (3948821281 0 0) 0x7=
fcb380009a0
> >> con 0x1cad3e0
> >>
> >> This is probably because the osd isn't finding the rbd class.
> >> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
> >> rbd_cls.so is,
> >> try adding 'osd class dir =3D /path/to/rados-classes' to the [osd]
> >> section
> >> in your ceph.conf, and restarting the osds.
> >>
> >> If you set 'debug osd =3D 10' you should see '_load_class rbd' in =
the osd
> >> log
> >> when you try to create an rbd image.
> >>
> >> Autotools should be setting the default location correctly, but if
> >> you're
> >> running the osds in a chroot or something the path would be wrong.
> >>
> >=20
> > Yeah all that was in the right place, I rebooted and I magically
> > stopped getting
> > that error, but now I'm getting this
> >=20
> > http://fpaste.org/OE92/
> >=20
> > with that ping thing repeating over and over.  Thanks,
>=20
> That just looks like the osd isn't running. If you restart the
> osd with 'debug osd =3D 20' the osd log should tell us what's going o=
n.

Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs st=
uff after
reboot.  But now I'm back to my original problem

http://fpaste.org/PfwO/

I have the osd class dir =3D /usr/lib64/rados-classes thing set and lib=
cls_rbd is
in there, so I'm not sure what is wrong.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-03 19:49                 ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-03 19:49 UTC (permalink / raw)
  To: Josh Durgin
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
> wrote:
> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik <josef@redhat.com>
> >> wrote:
> >> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> >> >> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> >> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> >> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> >> >> > Performance with the current "for-linux-min" branch and big metadata
> >> >> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> >> >> > that seems to occur from time to time:
> >> >> >
> >> >> > Actually, before you do that... we have a new tool,
> >> >> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> >> >> > local file system.  It's a subset of what a full OSD might do, but if
> >> >> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >> >> >
> >> >> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >> >> >
> >> >> > will hopefully do the trick.
> >> >> >
> >> >> > Christian, maybe you can see if that is able to trigger this warning?
> >> >> > You'll need to pull it from the current master branch; it wasn't in the
> >> >> > last release.
> >> >>
> >> >> Trying to reproduce with test_filestore_workloadgen didn't work for
> >> >> me. So here are some instructions on how to reproduce with a minimal
> >> >> ceph setup.
> >> >>
> >> >> You will need a single system with two disks and a bit of memory.
> >> >>
> >> >> - Compile and install ceph (detailed instructions:
> >> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >> >>
> >> >> - For the test setup I've used two tmpfs files as journal devices. To
> >> >> create these, do the following:
> >> >>
> >> >> # mkdir -p /ceph/temp
> >> >> # mount -t tmpfs tmpfs /ceph/temp
> >> >> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> >> >> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> >> >>
> >> >> - Now you should create and mount btrfs. Here is what I did:
> >> >>
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> >> # mkdir /ceph/osd.000
> >> >> # mkdir /ceph/osd.001
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> >> >>
> >> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> >> >> will probably have to change the btrfs devices and the hostname
> >> >> (os39).
> >> >>
> >> >> - Create the ceph filesystems:
> >> >>
> >> >> # mkdir /ceph/mon
> >> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >> >>
> >> >> - Start ceph (e.g. "service ceph start")
> >> >>
> >> >> - Now you should be able to use ceph - "ceph -s" will tell you about
> >> >> the state of the ceph cluster.
> >> >>
> >> >> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> >> >>
> >> >
> >> > It's failing here
> >> >
> >> > http://fpaste.org/e3BG/
> >>
> >> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
> >> osd.1 127.0.0.1:6803/2379 3 ==== osd_op_reply(3 rbd_info [call] = -5
> >> (Input/output error)) v4 ==== 107+0+0 (3948821281 0 0) 0x7fcb380009a0
> >> con 0x1cad3e0
> >>
> >> This is probably because the osd isn't finding the rbd class.
> >> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
> >> rbd_cls.so is,
> >> try adding 'osd class dir = /path/to/rados-classes' to the [osd]
> >> section
> >> in your ceph.conf, and restarting the osds.
> >>
> >> If you set 'debug osd = 10' you should see '_load_class rbd' in the osd
> >> log
> >> when you try to create an rbd image.
> >>
> >> Autotools should be setting the default location correctly, but if
> >> you're
> >> running the osds in a chroot or something the path would be wrong.
> >>
> > 
> > Yeah all that was in the right place, I rebooted and I magically
> > stopped getting
> > that error, but now I'm getting this
> > 
> > http://fpaste.org/OE92/
> > 
> > with that ping thing repeating over and over.  Thanks,
> 
> That just looks like the osd isn't running. If you restart the
> osd with 'debug osd = 20' the osd log should tell us what's going on.

Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff after
reboot.  But now I'm back to my original problem

http://fpaste.org/PfwO/

I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd is
in there, so I'm not sure what is wrong.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-03 19:49                 ` Josef Bacik
@ 2012-05-04 20:24                   ` Christian Brunner
  -1 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-04 20:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Josh Durgin, Sage Weil, linux-btrfs, ceph-devel

2012/5/3 Josef Bacik <josef@redhat.com>:
> On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
>> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
>> wrote:
>> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
>> >
>> > Yeah all that was in the right place, I rebooted and I magically
>> > stopped getting
>> > that error, but now I'm getting this
>> >
>> > http://fpaste.org/OE92/
>> >
>> > with that ping thing repeating over and over. =A0Thanks,
>>
>> That just looks like the osd isn't running. If you restart the
>> osd with 'debug osd =3D 20' the osd log should tell us what's going =
on.
>
> Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs =
stuff after
> reboot. =A0But now I'm back to my original problem
>
> http://fpaste.org/PfwO/
>
> I have the osd class dir =3D /usr/lib64/rados-classes thing set and l=
ibcls_rbd is
> in there, so I'm not sure what is wrong. =A0Thanks,

Thats really strange. Do you have the osd logs in /var/log/ceph? If
so, can you look if you find anything about "rbd" or "class" loading
in there?

Another thing you should try is, whether you can access ceph with rados=
:

# rados -p rbd ls
# rados -p rbd -i /proc/cpuinfo put testobj
# rados -p rbd -o - get testobj

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-04 20:24                   ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-04 20:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Josh Durgin, Sage Weil, linux-btrfs, ceph-devel

2012/5/3 Josef Bacik <josef@redhat.com>:
> On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
>> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
>> wrote:
>> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
>> >
>> > Yeah all that was in the right place, I rebooted and I magically
>> > stopped getting
>> > that error, but now I'm getting this
>> >
>> > http://fpaste.org/OE92/
>> >
>> > with that ping thing repeating over and over.  Thanks,
>>
>> That just looks like the osd isn't running. If you restart the
>> osd with 'debug osd = 20' the osd log should tell us what's going on.
>
> Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff after
> reboot.  But now I'm back to my original problem
>
> http://fpaste.org/PfwO/
>
> I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd is
> in there, so I'm not sure what is wrong.  Thanks,

Thats really strange. Do you have the osd logs in /var/log/ceph? If
so, can you look if you find anything about "rbd" or "class" loading
in there?

Another thing you should try is, whether you can access ceph with rados:

# rados -p rbd ls
# rados -p rbd -i /proc/cpuinfo put testobj
# rados -p rbd -o - get testobj

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-04 20:24                   ` Christian Brunner
@ 2012-05-09 20:25                     ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-09 20:25 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, Josh Durgin, Sage Weil, linux-btrfs, ceph-devel

On Fri, May 04, 2012 at 10:24:16PM +0200, Christian Brunner wrote:
> 2012/5/3 Josef Bacik <josef@redhat.com>:
> > On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
> >> wrote:
> >> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> >
> >> > Yeah all that was in the right place, I rebooted and I magically
> >> > stopped getting
> >> > that error, but now I'm getting this
> >> >
> >> > http://fpaste.org/OE92/
> >> >
> >> > with that ping thing repeating over and over. =A0Thanks,
> >>
> >> That just looks like the osd isn't running. If you restart the
> >> osd with 'debug osd =3D 20' the osd log should tell us what's goin=
g on.
> >
> > Ok that part was my fault, Duh I need to redo the tmpfs and mkcephf=
s stuff after
> > reboot. =A0But now I'm back to my original problem
> >
> > http://fpaste.org/PfwO/
> >
> > I have the osd class dir =3D /usr/lib64/rados-classes thing set and=
 libcls_rbd is
> > in there, so I'm not sure what is wrong. =A0Thanks,
>=20
> Thats really strange. Do you have the osd logs in /var/log/ceph? If
> so, can you look if you find anything about "rbd" or "class" loading
> in there?
>=20
> Another thing you should try is, whether you can access ceph with rad=
os:
>=20
> # rados -p rbd ls
> # rados -p rbd -i /proc/cpuinfo put testobj
> # rados -p rbd -o - get testobj
>

Ok weirdly ceph is trying to dlopen /usr/lib64/rados-classes/libcls_rbd=
=2Eso but
all I had was libcls_rbd.so.1 and libcls_rbd.so.1.0.0.  Symlink fixed t=
hat part,
I'll see if I can reproduce now.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-09 20:25                     ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-09 20:25 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, Josh Durgin, Sage Weil, linux-btrfs, ceph-devel

On Fri, May 04, 2012 at 10:24:16PM +0200, Christian Brunner wrote:
> 2012/5/3 Josef Bacik <josef@redhat.com>:
> > On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik <josef@redhat.com>
> >> wrote:
> >> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> >
> >> > Yeah all that was in the right place, I rebooted and I magically
> >> > stopped getting
> >> > that error, but now I'm getting this
> >> >
> >> > http://fpaste.org/OE92/
> >> >
> >> > with that ping thing repeating over and over.  Thanks,
> >>
> >> That just looks like the osd isn't running. If you restart the
> >> osd with 'debug osd = 20' the osd log should tell us what's going on.
> >
> > Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff after
> > reboot.  But now I'm back to my original problem
> >
> > http://fpaste.org/PfwO/
> >
> > I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd is
> > in there, so I'm not sure what is wrong.  Thanks,
> 
> Thats really strange. Do you have the osd logs in /var/log/ceph? If
> so, can you look if you find anything about "rbd" or "class" loading
> in there?
> 
> Another thing you should try is, whether you can access ceph with rados:
> 
> # rados -p rbd ls
> # rados -p rbd -i /proc/cpuinfo put testobj
> # rados -p rbd -o - get testobj
>

Ok weirdly ceph is trying to dlopen /usr/lib64/rados-classes/libcls_rbd.so but
all I had was libcls_rbd.so.1 and libcls_rbd.so.1.0.0.  Symlink fixed that part,
I'll see if I can reproduce now.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-27 11:02     ` Christian Brunner
@ 2012-05-10 17:40         ` Josef Bacik
  2012-05-10 17:40         ` Josef Bacik
  2012-05-10 20:35         ` Josef Bacik
  2 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-10 17:40 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs =
again.
> >> > Performance with the current "for-linux-min" branch and big meta=
data
> >> > is much better. The only problem (?) I'm still seeing is a warni=
ng
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload=
 on the
> > local file system. =A0It's a subset of what a full OSD might do, bu=
t if
> > we're lucky it will be sufficient to reproduce this issue. =A0Somet=
hing like
> >
> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warnin=
g?
> > You'll need to pull it from the current master branch; it wasn't in=
 the
> > last release.
>=20
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
>=20
> You will need a single system with two disks and a bit of memory.
>=20
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>=20
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
>=20
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024k
> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024k
>=20
> - Now you should create and mount btrfs. Here is what I did:
>=20
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/=
osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/=
osd.001
>=20
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
>=20
> - Create the ceph filesystems:
>=20
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
>=20
> - Start ceph (e.g. "service ceph start")
>=20
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
>=20
> - "rbd create -size 100 testimg" will create an rbd image on the ceph=
 cluster.
>=20
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
>=20
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
>=20

Well it's only taken me 2 weeks but I've finally git it all up and runn=
ing,
hopefully I'll reproduce.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-10 17:40         ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-10 17:40 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
> 
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
> 

Well it's only taken me 2 weeks but I've finally git it all up and running,
hopefully I'll reproduce.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-04-27 11:02     ` Christian Brunner
@ 2012-05-10 20:35         ` Josef Bacik
  2012-05-10 17:40         ` Josef Bacik
  2012-05-10 20:35         ` Josef Bacik
  2 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-10 20:35 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs =
again.
> >> > Performance with the current "for-linux-min" branch and big meta=
data
> >> > is much better. The only problem (?) I'm still seeing is a warni=
ng
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload=
 on the
> > local file system. =A0It's a subset of what a full OSD might do, bu=
t if
> > we're lucky it will be sufficient to reproduce this issue. =A0Somet=
hing like
> >
> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warnin=
g?
> > You'll need to pull it from the current master branch; it wasn't in=
 the
> > last release.
>=20
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
>=20
> You will need a single system with two disks and a bit of memory.
>=20
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
>=20
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
>=20
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024k
> # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024k
>=20
> - Now you should create and mount btrfs. Here is what I did:
>=20
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/=
osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/=
osd.001
>=20
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
>=20
> - Create the ceph filesystems:
>=20
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
>=20
> - Start ceph (e.g. "service ceph start")
>=20
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
>=20
> - "rbd create -size 100 testimg" will create an rbd image on the ceph=
 cluster.
>=20
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
>=20
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
>=20

Well I feel like an idiot, I finally get it to reproduce, go look at wh=
ere I
want to put my printks and theres the problem staring me right in the f=
ace.
I've looked seriously at this problem 2 or 3 times and have missed this=
 every
single freaking time.  Here is the patch I'm trying, please try it on y=
ours to
make sure it fixes the problem.  It takes like 2 hours for it to reprod=
uce for
me so I won't be able to fully test it until tomorrow, but so far it ha=
sn't
broken anything so it should be good.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eefe573..4ad628d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
=20
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we nee=
d
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -164,6 +161,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
=20
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a89888..6dd20f3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
=20
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f849b3..8bbe8c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1148,7 +1148,6 @@ static void __setup_root(u32 nodesize, u32 leafsi=
ze, u32 sectorsize,
 	root->orphan_block_rsv =3D NULL;
=20
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1161,6 +1160,7 @@ static void __setup_root(u32 nodesize, u32 leafsi=
ze, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch =3D 0;
 	root->log_transid =3D 0;
 	root->last_log_commit =3D 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0218a4e..0265d40 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2138,12 +2138,12 @@ void btrfs_orphan_commit_root(struct btrfs_tran=
s_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
=20
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state !=3D ORPHAN_CLEANUP_DONE)
 		return;
=20
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2200,8 +2200,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 		block_rsv =3D NULL;
 	}
=20
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item =3D 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2214,6 +2214,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 			insert =3D 1;
 #endif
 		insert =3D 1;
+		atomic_inc(&root->orphan_inodes);
 	}
=20
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2261,9 +2262,8 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 	int release_rsv =3D 0;
 	int ret =3D 0;
=20
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item =3D 0;
 		delete_item =3D 1;
 	}
=20
@@ -2271,7 +2271,6 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 		BTRFS_I(inode)->orphan_meta_reserved =3D 0;
 		release_rsv =3D 1;
 	}
-	spin_unlock(&root->orphan_lock);
=20
 	if (trans && delete_item) {
 		ret =3D btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
@@ -2281,6 +2280,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
=20
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
=20
@@ -2418,9 +2420,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		atomic_inc(&root->orphan_inodes);
+		BTRFS_I(inode)->has_orphan_item =3D 1;
=20
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3741,7 +3742,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
=20
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
=20
@@ -6921,6 +6922,7 @@ struct inode *btrfs_alloc_inode(struct super_bloc=
k *sb)
 	ei->in_defrag =3D 0;
 	ei->delalloc_meta_reserved =3D 0;
 	ei->complete_ordered =3D 0;
+	ei->has_orphan_item =3D 0;
 	ei->force_compress =3D BTRFS_COMPRESS_NONE;
=20
 	ei->delayed_node =3D NULL;
@@ -6934,7 +6936,6 @@ struct inode *btrfs_alloc_inode(struct super_bloc=
k *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	INIT_LIST_HEAD(&ei->ordered_finished);
@@ -6980,13 +6981,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
=20
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
=20
 	while (1) {
 		ordered =3D btrfs_lookup_first_ordered_extent(inode, (u64)-1);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-10 20:35         ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-10 20:35 UTC (permalink / raw)
  To: Christian Brunner; +Cc: Sage Weil, Josef Bacik, linux-btrfs, ceph-devel

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
> 
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
> 

Well I feel like an idiot, I finally get it to reproduce, go look at where I
want to put my printks and theres the problem staring me right in the face.
I've looked seriously at this problem 2 or 3 times and have missed this every
single freaking time.  Here is the patch I'm trying, please try it on yours to
make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
me so I won't be able to fully test it until tomorrow, but so far it hasn't
broken anything so it should be good.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eefe573..4ad628d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -164,6 +161,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a89888..6dd20f3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f849b3..8bbe8c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1148,7 +1148,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1161,6 +1160,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0218a4e..0265d40 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2138,12 +2138,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2200,8 +2200,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2214,6 +2214,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2261,9 +2262,8 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 0;
 		delete_item = 1;
 	}
 
@@ -2271,7 +2271,6 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
-	spin_unlock(&root->orphan_lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
@@ -2281,6 +2280,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2418,9 +2420,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		atomic_inc(&root->orphan_inodes);
+		BTRFS_I(inode)->has_orphan_item = 1;
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3741,7 +3742,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6921,6 +6922,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
 	ei->complete_ordered = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6934,7 +6936,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	INIT_LIST_HEAD(&ei->ordered_finished);
@@ -6980,13 +6981,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-10 20:35         ` Josef Bacik
@ 2012-05-11 13:31           ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-11 13:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 10, 2012 at 04:35:23PM -0400, Josef Bacik wrote:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> > Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > > On Tue, 24 Apr 2012, Josef Bacik wrote:
> > >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrot=
e:
> > >> > After running ceph on XFS for some time, I decided to try btrf=
s again.
> > >> > Performance with the current "for-linux-min" branch and big me=
tadata
> > >> > is much better. The only problem (?) I'm still seeing is a war=
ning
> > >> > that seems to occur from time to time:
> > >
> > > Actually, before you do that... we have a new tool,
> > > test_filestore_workloadgen, that generates a ceph-osd-like worklo=
ad on the
> > > local file system. =A0It's a subset of what a full OSD might do, =
but if
> > > we're lucky it will be sufficient to reproduce this issue. =A0Som=
ething like
> > >
> > > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> > >
> > > will hopefully do the trick.
> > >
> > > Christian, maybe you can see if that is able to trigger this warn=
ing?
> > > You'll need to pull it from the current master branch; it wasn't =
in the
> > > last release.
> >=20
> > Trying to reproduce with test_filestore_workloadgen didn't work for
> > me. So here are some instructions on how to reproduce with a minima=
l
> > ceph setup.
> >=20
> > You will need a single system with two disks and a bit of memory.
> >=20
> > - Compile and install ceph (detailed instructions:
> > http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >=20
> > - For the test setup I've used two tmpfs files as journal devices. =
To
> > create these, do the following:
> >=20
> > # mkdir -p /ceph/temp
> > # mount -t tmpfs tmpfs /ceph/temp
> > # dd if=3D/dev/zero of=3D/ceph/temp/journal0 count=3D500 bs=3D1024k
> > # dd if=3D/dev/zero of=3D/ceph/temp/journal1 count=3D500 bs=3D1024k
> >=20
> > - Now you should create and mount btrfs. Here is what I did:
> >=20
> > # mkfs.btrfs -l 64k -n 64k /dev/sda
> > # mkfs.btrfs -l 64k -n 64k /dev/sdb
> > # mkdir /ceph/osd.000
> > # mkdir /ceph/osd.001
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /cep=
h/osd.000
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /cep=
h/osd.001
> >=20
> > - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> > will probably have to change the btrfs devices and the hostname
> > (os39).
> >=20
> > - Create the ceph filesystems:
> >=20
> > # mkdir /ceph/mon
> > # mkcephfs -a -c /etc/ceph/ceph.conf
> >=20
> > - Start ceph (e.g. "service ceph start")
> >=20
> > - Now you should be able to use ceph - "ceph -s" will tell you abou=
t
> > the state of the ceph cluster.
> >=20
> > - "rbd create -size 100 testimg" will create an rbd image on the ce=
ph cluster.
> >=20
> > - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> > with "./rbdtest testimg".
> >=20
> > I can see the first btrfs_orphan_commit_root warning after an hour =
or
> > so... I hope that I've described all necessary steps. If there is a
> > problem just send me a note.
> >=20
>=20
> Well I feel like an idiot, I finally get it to reproduce, go look at =
where I
> want to put my printks and theres the problem staring me right in the=
 face.
> I've looked seriously at this problem 2 or 3 times and have missed th=
is every
> single freaking time.  Here is the patch I'm trying, please try it on=
 yours to
> make sure it fixes the problem.  It takes like 2 hours for it to repr=
oduce for
> me so I won't be able to fully test it until tomorrow, but so far it =
hasn't
> broken anything so it should be good.  Thanks,
>=20

That previous patch was against btrfs-next, this patch is against 3.4-r=
c6 if you
are on mainline.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..54af1fa 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
=20
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we nee=
d
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -156,6 +153,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
=20
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
=20
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsi=
ze, u32 sectorsize,
 	root->orphan_block_rsv =3D NULL;
=20
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsi=
ze, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch =3D 0;
 	root->log_transid =3D 0;
 	root->last_log_commit =3D 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..78ce750 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_tran=
s_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
=20
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state !=3D ORPHAN_CLEANUP_DONE)
 		return;
=20
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 		block_rsv =3D NULL;
 	}
=20
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item =3D 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 			insert =3D 1;
 #endif
 		insert =3D 1;
+		atomic_inc(&root->orphan_inodes);
 	}
=20
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2195,9 +2196,8 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 	int release_rsv =3D 0;
 	int ret =3D 0;
=20
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item =3D 0;
 		delete_item =3D 1;
 	}
=20
@@ -2205,7 +2205,6 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 		BTRFS_I(inode)->orphan_meta_reserved =3D 0;
 		release_rsv =3D 1;
 	}
-	spin_unlock(&root->orphan_lock);
=20
 	if (trans && delete_item) {
 		ret =3D btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
@@ -2215,6 +2214,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *t=
rans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
=20
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
=20
@@ -2352,9 +2354,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		atomic_inc(&root->orphan_inodes);
+		BTRFS_I(inode)->has_orphan_item =3D 1;
=20
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3671,7 +3672,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
=20
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
=20
@@ -6914,6 +6915,7 @@ struct inode *btrfs_alloc_inode(struct super_bloc=
k *sb)
 	ei->dummy_inode =3D 0;
 	ei->in_defrag =3D 0;
 	ei->delalloc_meta_reserved =3D 0;
+	ei->has_orphan_item =3D 0;
 	ei->force_compress =3D BTRFS_COMPRESS_NONE;
=20
 	ei->delayed_node =3D NULL;
@@ -6927,7 +6929,6 @@ struct inode *btrfs_alloc_inode(struct super_bloc=
k *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +6973,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
=20
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
=20
 	while (1) {
 		ordered =3D btrfs_lookup_first_ordered_extent(inode, (u64)-1);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-11 13:31           ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-11 13:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 10, 2012 at 04:35:23PM -0400, Josef Bacik wrote:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> > Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
> > > On Tue, 24 Apr 2012, Josef Bacik wrote:
> > >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > >> > After running ceph on XFS for some time, I decided to try btrfs again.
> > >> > Performance with the current "for-linux-min" branch and big metadata
> > >> > is much better. The only problem (?) I'm still seeing is a warning
> > >> > that seems to occur from time to time:
> > >
> > > Actually, before you do that... we have a new tool,
> > > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > > local file system.  It's a subset of what a full OSD might do, but if
> > > we're lucky it will be sufficient to reproduce this issue.  Something like
> > >
> > >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> > >
> > > will hopefully do the trick.
> > >
> > > Christian, maybe you can see if that is able to trigger this warning?
> > > You'll need to pull it from the current master branch; it wasn't in the
> > > last release.
> > 
> > Trying to reproduce with test_filestore_workloadgen didn't work for
> > me. So here are some instructions on how to reproduce with a minimal
> > ceph setup.
> > 
> > You will need a single system with two disks and a bit of memory.
> > 
> > - Compile and install ceph (detailed instructions:
> > http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> > 
> > - For the test setup I've used two tmpfs files as journal devices. To
> > create these, do the following:
> > 
> > # mkdir -p /ceph/temp
> > # mount -t tmpfs tmpfs /ceph/temp
> > # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> > # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> > 
> > - Now you should create and mount btrfs. Here is what I did:
> > 
> > # mkfs.btrfs -l 64k -n 64k /dev/sda
> > # mkfs.btrfs -l 64k -n 64k /dev/sdb
> > # mkdir /ceph/osd.000
> > # mkdir /ceph/osd.001
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> > 
> > - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> > will probably have to change the btrfs devices and the hostname
> > (os39).
> > 
> > - Create the ceph filesystems:
> > 
> > # mkdir /ceph/mon
> > # mkcephfs -a -c /etc/ceph/ceph.conf
> > 
> > - Start ceph (e.g. "service ceph start")
> > 
> > - Now you should be able to use ceph - "ceph -s" will tell you about
> > the state of the ceph cluster.
> > 
> > - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> > 
> > - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> > with "./rbdtest testimg".
> > 
> > I can see the first btrfs_orphan_commit_root warning after an hour or
> > so... I hope that I've described all necessary steps. If there is a
> > problem just send me a note.
> > 
> 
> Well I feel like an idiot, I finally get it to reproduce, go look at where I
> want to put my printks and theres the problem staring me right in the face.
> I've looked seriously at this problem 2 or 3 times and have missed this every
> single freaking time.  Here is the patch I'm trying, please try it on yours to
> make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
> me so I won't be able to fully test it until tomorrow, but so far it hasn't
> broken anything so it should be good.  Thanks,
> 

That previous patch was against btrfs-next, this patch is against 3.4-rc6 if you
are on mainline.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..54af1fa 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -156,6 +153,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..78ce750 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2195,9 +2196,8 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 0;
 		delete_item = 1;
 	}
 
@@ -2205,7 +2205,6 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
-	spin_unlock(&root->orphan_lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
@@ -2215,6 +2214,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2352,9 +2354,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		atomic_inc(&root->orphan_inodes);
+		BTRFS_I(inode)->has_orphan_item = 1;
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3671,7 +3672,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6914,6 +6915,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6927,7 +6929,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +6973,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-10 20:35         ` Josef Bacik
@ 2012-05-11 13:46           ` Christian Brunner
  -1 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-11 13:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, linux-btrfs, ceph-devel

2012/5/10 Josef Bacik <josef@redhat.com>:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote=
:
>> >> > After running ceph on XFS for some time, I decided to try btrfs=
 again.
>> >> > Performance with the current "for-linux-min" branch and big met=
adata
>> >> > is much better. The only problem (?) I'm still seeing is a warn=
ing
>> >> > that seems to occur from time to time:
>> >
>> > Actually, before you do that... we have a new tool,
>> > test_filestore_workloadgen, that generates a ceph-osd-like workloa=
d on the
>> > local file system. =A0It's a subset of what a full OSD might do, b=
ut if
>> > we're lucky it will be sufficient to reproduce this issue. =A0Some=
thing like
>> >
>> > =A0test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>> >
>> > will hopefully do the trick.
>> >
>> > Christian, maybe you can see if that is able to trigger this warni=
ng?
>> > You'll need to pull it from the current master branch; it wasn't i=
n the
>> > last release.
>>
>> Trying to reproduce with test_filestore_workloadgen didn't work for
>> me. So here are some instructions on how to reproduce with a minimal
>> ceph setup.
>> [...]
>
> Well I feel like an idiot, I finally get it to reproduce, go look at =
where I
> want to put my printks and theres the problem staring me right in the=
 face.
> I've looked seriously at this problem 2 or 3 times and have missed th=
is every
> single freaking time. =A0Here is the patch I'm trying, please try it =
on yours to
> make sure it fixes the problem. =A0It takes like 2 hours for it to re=
produce for
> me so I won't be able to fully test it until tomorrow, but so far it =
hasn't
> broken anything so it should be good. =A0Thanks,

Great! I've put your patch on my testbox and will run a test over the
weekend. I'll report back on monday.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-11 13:46           ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-11 13:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Sage Weil, linux-btrfs, ceph-devel

2012/5/10 Josef Bacik <josef@redhat.com>:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> Am 24. April 2012 18:26 schrieb Sage Weil <sage@newdream.net>:
>> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> >> > After running ceph on XFS for some time, I decided to try btrfs again.
>> >> > Performance with the current "for-linux-min" branch and big metadata
>> >> > is much better. The only problem (?) I'm still seeing is a warning
>> >> > that seems to occur from time to time:
>> >
>> > Actually, before you do that... we have a new tool,
>> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
>> > local file system.  It's a subset of what a full OSD might do, but if
>> > we're lucky it will be sufficient to reproduce this issue.  Something like
>> >
>> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>> >
>> > will hopefully do the trick.
>> >
>> > Christian, maybe you can see if that is able to trigger this warning?
>> > You'll need to pull it from the current master branch; it wasn't in the
>> > last release.
>>
>> Trying to reproduce with test_filestore_workloadgen didn't work for
>> me. So here are some instructions on how to reproduce with a minimal
>> ceph setup.
>> [...]
>
> Well I feel like an idiot, I finally get it to reproduce, go look at where I
> want to put my printks and theres the problem staring me right in the face.
> I've looked seriously at this problem 2 or 3 times and have missed this every
> single freaking time.  Here is the patch I'm trying, please try it on yours to
> make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
> me so I won't be able to fully test it until tomorrow, but so far it hasn't
> broken anything so it should be good.  Thanks,

Great! I've put your patch on my testbox and will run a test over the
weekend. I'll report back on monday.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-11 13:31           ` Josef Bacik
  (?)
@ 2012-05-11 18:33           ` Martin Mailand
  2012-05-11 19:16             ` Josef Bacik
  -1 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-11 18:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

Am 11.05.2012 15:31, schrieb Josef Bacik:
> That previous patch was against btrfs-next, this patch is against 3.4-rc6 if you
> are on mainline.  Thanks,

I tried your patch against mainline, after a few minutes I hit this bug.

[ 1078.523655] ------------[ cut here ]------------
[ 1078.523667] kernel BUG at fs/btrfs/inode.c:2211!
[ 1078.523676] invalid opcode: 0000 [#1] SMP
[ 1078.523692] CPU 5
[ 1078.523696] Modules linked in: btrfs zlib_deflate libcrc32c mlx4_en 
bonding ext2 coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core mei(C) joydev ses ioatdma 
enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
igb megaraid_sas mlx4_core dca
[ 1078.523813]
[ 1078.523818] Pid: 4108, comm: ceph-osd Tainted: G         C 
3.4.0-rc6+ #5 Supermicro X9SRi/X9SRi
[ 1078.523841] RIP: 0010:[<ffffffffa022b2a2>]  [<ffffffffa022b2a2>] 
btrfs_orphan_del+0xb2/0xc0 [btrfs]
[ 1078.523867] RSP: 0018:ffff880ff14a5d38  EFLAGS: 00010282
[ 1078.523877] RAX: 00000000fffffffe RBX: ffff880ff004d6f0 RCX: 
0000000000117400
[ 1078.523891] RDX: 00000000001173ff RSI: ffff8810279f6ea0 RDI: 
ffffea00409e7d80
[ 1078.523905] RBP: ffff880ff14a5d58 R08: 000060ef80001400 R09: 
ffffffffa0202c6a
[ 1078.523918] R10: 0000000000000000 R11: 00000000000000ba R12: 
0000000000000001
[ 1078.523932] R13: ffff881017663c00 R14: 0000000000000001 R15: 
ffff88101776f5a0
[ 1078.523946] FS:  00007f1d2c03c700(0000) GS:ffff88107fca0000(0000) 
knlGS:0000000000000000
[ 1078.523961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1078.523990] CR2: 00000000050f4000 CR3: 0000000ff2a57000 CR4: 
00000000000407e0
[ 1078.524019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1078.524048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1078.524077] Process ceph-osd (pid: 4108, threadinfo ffff880ff14a4000, 
task ffff880ff2aa44a0)
[ 1078.524121] Stack:
[ 1078.524141]  ffff8810279f7460 0000000000000000 ffff881017663c00 
ffff880ff004d6f0
[ 1078.524190]  ffff880ff14a5e08 ffffffffa022f5d8 ffff880ff004d6f0 
0000000000000000
[ 1078.524240]  ffff880ff14a5e18 ffffffff81188afd 0000800000000000 
0000800000001000
[ 1078.524289] Call Trace:
[ 1078.524317]  [<ffffffffa022f5d8>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 1078.524348]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 1078.524380]  [<ffffffffa0230f91>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 1078.524408]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 1078.524435]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 1078.524461]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 1078.524489]  [<ffffffff8165fd69>] system_call_fastpath+0x16/0x1b
[ 1078.524516] Code: 8b 65 e8 4c 8b 6d f0 4c 8b 75 f8 c9 c3 0f 1f 40 00 
80 bb 60 fe ff ff 84 75 c1 eb bb 0f 1f 44 00 00 48 89 df e8 a0 73 fe ff 
eb c1 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 1078.524710] RIP  [<ffffffffa022b2a2>] btrfs_orphan_del+0xb2/0xc0 [btrfs]
[ 1078.524744]  RSP <ffff880ff14a5d38>
[ 1078.525013] ---[ end trace 88c92720204f7aa4 ]---


That's the drive with the broken btrfs.

[  212.843776] device fsid 28492275-01d3-4e89-9f1c-bd86057194bf devid 1 
transid 4 /dev/sdc
[  212.844630] btrfs: setting nodatacow
[  212.844637] btrfs: enabling auto defrag
[  212.844640] btrfs: disk space caching is enabled
[  212.844643] btrfs flagging fs with big metadata feature



-martin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-11 18:33           ` Martin Mailand
@ 2012-05-11 19:16             ` Josef Bacik
  2012-05-14 14:19               ` Martin Mailand
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-11 19:16 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Fri, May 11, 2012 at 08:33:34PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> Am 11.05.2012 15:31, schrieb Josef Bacik:
> >That previous patch was against btrfs-next, this patch is against 3.4-rc6 if you
> >are on mainline.  Thanks,
> 
> I tried your patch against mainline, after a few minutes I hit this bug.
> 

Heh duh, sorry, try this one instead.  Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..54af1fa 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -156,6 +153,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..5ba68d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2195,9 +2196,13 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 0;
 		delete_item = 1;
 	}
 
@@ -2215,6 +2220,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2352,9 +2360,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		atomic_inc(&root->orphan_inodes);
+		BTRFS_I(inode)->has_orphan_item = 1;
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3671,7 +3678,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6914,6 +6921,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6927,7 +6935,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +6979,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-11 19:16             ` Josef Bacik
@ 2012-05-14 14:19               ` Martin Mailand
  2012-05-14 14:20                 ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-14 14:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

Am 11.05.2012 21:16, schrieb Josef Bacik:
> Heh duh, sorry, try this one instead.  Thanks,

With this patch I got this Bug:

[ 8233.828722] ------------[ cut here ]------------
[ 8233.828737] kernel BUG at fs/btrfs/inode.c:2217!
[ 8233.828746] invalid opcode: 0000 [#1] SMP
[ 8233.828761] CPU 1
[ 8233.828766] Modules linked in: btrfs zlib_deflate libcrc32c ses 
enclosure bonding coretemp ghash_clmulni_intel psmouse aesni_intel 
sb_edac cryptd a     es_x86_64 ext2 microcode serio_raw edac_core mei(C) 
joydev ioatdma mac_hid lp parport usbhid hid isci libsas ixgbe 
scsi_transport_sas megaraid_sas igb      dca mdio
[ 8233.828885]
[ 8233.828891] Pid: 4444, comm: ceph-osd Tainted: G        WC 
3.4.0-rc6+ #6 Supermicro X9SRi/X9SRi
[ 8233.828915] RIP: 0010:[<ffffffffa02492d2>]  [<ffffffffa02492d2>] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 8233.828947] RSP: 0018:ffff88101ce53d18  EFLAGS: 00010282
[ 8233.828957] RAX: 00000000fffffffe RBX: ffff880d194e2c50 RCX: 
0000000000d0a3be
[ 8233.828971] RDX: 0000000000d0a3bd RSI: ffff88101de2a000 RDI: 
ffffea0040778a80
[ 8233.828985] RBP: ffff88101ce53d58 R08: 000060ef80000f00 R09: 
ffffffffa0220c6a
[ 8233.828999] R10: 0000000000000000 R11: 00000000000000f0 R12: 
ffff88071bb1e790
[ 8233.829029] R13: ffff88071bb1e400 R14: 0000000000000001 R15: 
0000000000000001
[ 8233.829059] FS:  00007fdfa179b700(0000) GS:ffff88107fc20000(0000) 
knlGS:0000000000000000
[ 8233.829104] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8233.829131] CR2: 000000000c614000 CR3: 00000001df9d2000 CR4: 
00000000000407e0
[ 8233.829160] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 8233.829190] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 8233.829220] Process ceph-osd (pid: 4444, threadinfo ffff88101ce52000, 
task ffff88101b7b96e0)
[ 8233.829265] Stack:
[ 8233.829286]  0c00000000000002 ffff88101de14cd0 ffff88101ce53d38 
ffff88101de14cd0
[ 8233.829336]  0000000000000000 ffff88071bb1e400 ffff880d194e2c50 
ffff881024680620
[ 8233.829386]  ffff88101ce53e08 ffffffffa024d608 ffff880d194e2c50 
0000000000000000
[ 8233.829436] Call Trace:
[ 8233.829472]  [<ffffffffa024d608>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 8233.829503]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 8233.829537]  [<ffffffffa024efc1>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 8233.829567]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 8233.829595]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 8233.829621]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 8233.829649]  [<ffffffff8165fd69>] system_call_fastpath+0x16/0x1b
[ 8233.829676] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 70 73 fe ff 
eb b8      <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 8233.829875] RIP  [<ffffffffa02492d2>] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 8233.829914]  RSP <ffff88101ce53d18>
[ 8233.830187] ---[ end trace 46dd4a711bf2979d ]---


-martin


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-14 14:19               ` Martin Mailand
@ 2012-05-14 14:20                 ` Josef Bacik
  2012-05-16 19:20                   ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-14 14:20 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Mon, May 14, 2012 at 04:19:37PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> Am 11.05.2012 21:16, schrieb Josef Bacik:
> >Heh duh, sorry, try this one instead.  Thanks,
> 
> With this patch I got this Bug:

Yeah Christian reported the same thing on Friday.  I'm going to work on a patch
and actually run it here to make sure it doesn't blow up and then send it to the
list when I think I've got something that works.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-14 14:20                 ` Josef Bacik
@ 2012-05-16 19:20                   ` Josef Bacik
  2012-05-17 10:29                     ` Martin Mailand
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-16 19:20 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Martin Mailand, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Mon, May 14, 2012 at 10:20:48AM -0400, Josef Bacik wrote:
> On Mon, May 14, 2012 at 04:19:37PM +0200, Martin Mailand wrote:
> > Hi Josef,
> > 
> > Am 11.05.2012 21:16, schrieb Josef Bacik:
> > >Heh duh, sorry, try this one instead.  Thanks,
> > 
> > With this patch I got this Bug:
> 
> Yeah Christian reported the same thing on Friday.  I'm going to work on a patch
> and actually run it here to make sure it doesn't blow up and then send it to the
> list when I think I've got something that works.  Thanks,
> 

Hrm ok so I finally got some time to try and debug it and let the test run a
good long while (5 hours almost) and I couldn't hit either the original bug or
the one you guys were hitting.  So either my extra little bit of locking did the
trick or I get to keep my "Worst reproducer ever" award.  Can you guys give this
one a whirl and if it panics send the entire dmesg since it should spit out a
WARN_ON() to let me know what I thought was the problem was it.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..c0cff20 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2166,8 +2166,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2180,6 +2180,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2198,6 +2199,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&root->orphan_lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&root->orphan_lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2227,9 +2231,13 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 0;
 		delete_item = 1;
 	}
 
@@ -2247,6 +2255,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2385,7 +2396,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * the proper thing when we hit it
 		 */
 		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
 		spin_unlock(&root->orphan_lock);
 
 		/* if we have links, this was a truncate, lets do that */
@@ -3707,7 +3720,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6866,6 +6879,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6879,7 +6893,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6924,13 +6937,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-16 19:20                   ` Josef Bacik
@ 2012-05-17 10:29                     ` Martin Mailand
  2012-05-17 14:43                       ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-17 10:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

somehow I still get the kernel Bug messages, I used your patch from the 
16th against rc7.

-martin

Am 16.05.2012 21:20, schrieb Josef Bacik:
> Hrm ok so I finally got some time to try and debug it and let the test run a
> good long while (5 hours almost) and I couldn't hit either the original bug or
> the one you guys were hitting.  So either my extra little bit of locking did the
> trick or I get to keep my "Worst reproducer ever" award.  Can you guys give this
> one a whirl and if it panics send the entire dmesg since it should spit out a
> WARN_ON() to let me know what I thought was the problem was it.  Thanks,

[ 2868.813236] ------------[ cut here ]------------
[ 2868.813297] kernel BUG at fs/btrfs/inode.c:2220!
[ 2868.813355] invalid opcode: 0000 [#2] SMP
[ 2868.813479] CPU 2
[ 2868.813516] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma 
enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb megaraid_sas dca mdio
[ 2868.814871]
[ 2868.814925] Pid: 5325, comm: ceph-osd Tainted: G      D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 2868.815108] RIP: 0010:[<ffffffffa02212f2>]  [<ffffffffa02212f2>] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 2868.815236] RSP: 0018:ffff880296e89d18  EFLAGS: 00010282
[ 2868.815294] RAX: 00000000fffffffe RBX: ffff88101ef3c390 RCX: 
0000000000562497
[ 2868.815355] RDX: 0000000000562496 RSI: ffff88101ef10000 RDI: 
ffffea00407bc400
[ 2868.815416] RBP: ffff880296e89d58 R08: 000060ef80000fd0 R09: 
ffffffffa01f8c6a
[ 2868.815476] R10: 0000000000000000 R11: 000000000000011d R12: 
ffff880fdf602790
[ 2868.815537] R13: ffff880fdf602400 R14: 0000000000000001 R15: 
0000000000000001
[ 2868.815598] FS:  00007f07d5512700(0000) GS:ffff88107fc40000(0000) 
knlGS:0000000000000000
[ 2868.815675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2868.815734] CR2: 000000000ab16000 CR3: 000000082a6b2000 CR4: 
00000000000407e0
[ 2868.815796] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 2868.815858] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 2868.815920] Process ceph-osd (pid: 5325, threadinfo ffff880296e88000, 
task ffff8810170616e0)
[ 2868.815997] Stack:
[ 2868.816049]  0c00000000000007 ffff88101ef12960 ffff880296e89d38 
ffff88101ef12960
[ 2868.816262]  0000000000000000 ffff880fdf602400 ffff88101ef3c390 
ffff880b4ce2f260
[ 2868.816485]  ffff880296e89e08 ffffffffa0225628 ffff88101ef3c390 
0000000000000000
[ 2868.816694] Call Trace:
[ 2868.816755]  [<ffffffffa0225628>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 2868.816817]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 2868.816880]  [<ffffffffa0227021>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 2868.816940]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 2868.816998]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 2868.817056]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 2868.817115]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[ 2868.817173] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 2868.819501] RIP  [<ffffffffa02212f2>] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 2868.819602]  RSP <ffff880296e89d18>
[ 2868.819703] ---[ end trace 94d17b770b376c84 ]---
[ 3249.857453] ------------[ cut here ]------------
[ 3249.857481] kernel BUG at fs/btrfs/inode.c:2220!
[ 3249.857506] invalid opcode: 0000 [#3] SMP
[ 3249.857534] CPU 0
[ 3249.857538] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma 
enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb megaraid_sas dca mdio
[ 3249.857721]
[ 3249.857740] Pid: 5384, comm: ceph-osd Tainted: G      D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 3249.857791] RIP: 0010:[<ffffffffa02212f2>]  [<ffffffffa02212f2>] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 3249.857847] RSP: 0018:ffff880abe8b5d18  EFLAGS: 00010282
[ 3249.857873] RAX: 00000000fffffffe RBX: ffff8807eb8b6670 RCX: 
000000000077a084
[ 3249.857902] RDX: 000000000077a083 RSI: ffff88101ee497e0 RDI: 
ffffea00407b9240
[ 3249.857931] RBP: ffff880abe8b5d58 R08: 000060ef80000fd0 R09: 
ffffffffa01f8c6a
[ 3249.857959] R10: 0000000000000000 R11: 0000000000000153 R12: 
ffff880d56825390
[ 3249.857988] R13: ffff880d56825000 R14: 0000000000000001 R15: 
0000000000000001
[ 3249.858017] FS:  00007f06bd13b700(0000) GS:ffff88107fc00000(0000) 
knlGS:0000000000000000
[ 3249.858062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3249.858088] CR2: 00000000043d2000 CR3: 0000000e7ebe5000 CR4: 
00000000000407f0
[ 3249.858117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 3249.858146] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 3249.858175] Process ceph-osd (pid: 5384, threadinfo ffff880abe8b4000, 
task ffff880eb7a596e0)
[ 3249.858219] Stack:
[ 3249.858239]  0c00000000000002 ffff88101ede4d70 ffff880abe8b5d38 
ffff88101ede4d70
[ 3249.858288]  0000000000000000 ffff880d56825000 ffff8807eb8b6670 
ffff880546925e00
[ 3249.858338]  ffff880abe8b5e08 ffffffffa0225628 ffff8807eb8b6670 
0000000000000000
[ 3249.858387] Call Trace:
[ 3249.858415]  [<ffffffffa0225628>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 3249.858445]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 3249.858477]  [<ffffffffa0227021>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 3249.858505]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 3249.858533]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 3249.858559]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 3249.858587]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[ 3249.858614] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 3249.858813] RIP  [<ffffffffa02212f2>] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 3249.858852]  RSP <ffff880abe8b5d18>
[ 3249.859140] ---[ end trace 94d17b770b376c85 ]---

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 10:29                     ` Martin Mailand
@ 2012-05-17 14:43                       ` Josef Bacik
  2012-05-17 15:12                         ` Martin Mailand
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-17 14:43 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 17, 2012 at 12:29:32PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> somehow I still get the kernel Bug messages, I used your patch from
> the 16th against rc7.
> 

Was there anything above those messages?  There should have been a WARN_ON() or
something.  If not thats fine, I just need to know one way or the other so I can
figure out what to do next.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 14:43                       ` Josef Bacik
@ 2012-05-17 15:12                         ` Martin Mailand
  2012-05-17 19:43                           ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-17 15:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,
no there was nothing above. Here the is another dmesg output.

> Was there anything above those messages?  There should have been a WARN_ON() or
> something.  If not thats fine, I just need to know one way or the other so I can
> figure out what to do next.  Thanks,
>
> Josef

-martin

[   63.027277] Btrfs loaded
[   63.027485] device fsid 266726e1-439f-4d89-a374-7ef92d355daf devid 1 
transid 4 /dev/sdc
[   63.027750] btrfs: setting nodatacow
[   63.027752] btrfs: enabling auto defrag
[   63.027753] btrfs: disk space caching is enabled
[   63.027754] btrfs flagging fs with big metadata feature
[   63.036347] device fsid 070e2c6c-2ea5-478d-bc07-7ce3a954e2e4 devid 1 
transid 4 /dev/sdd
[   63.036624] btrfs: setting nodatacow
[   63.036626] btrfs: enabling auto defrag
[   63.036627] btrfs: disk space caching is enabled
[   63.036628] btrfs flagging fs with big metadata feature
[   63.045628] device fsid 6f7b82a9-a1b7-40c6-8b00-2c2a44481066 devid 1 
transid 4 /dev/sde
[   63.045910] btrfs: setting nodatacow
[   63.045912] btrfs: enabling auto defrag
[   63.045913] btrfs: disk space caching is enabled
[   63.045914] btrfs flagging fs with big metadata feature
[   63.831278] device fsid 46890b76-45c2-4ea2-96ee-2ea88e29628b devid 1 
transid 4 /dev/sdf
[   63.831577] btrfs: setting nodatacow
[   63.831579] btrfs: enabling auto defrag
[   63.831579] btrfs: disk space caching is enabled
[   63.831580] btrfs flagging fs with big metadata feature
[ 1521.820412] ------------[ cut here ]------------
[ 1521.820424] kernel BUG at fs/btrfs/inode.c:2220!
[ 1521.820433] invalid opcode: 0000 [#1] SMP
[ 1521.820448] CPU 4
[ 1521.820452] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses 
enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd 
aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev 
ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb dca megaraid_sas mdio
[ 1521.820562]
[ 1521.820567] Pid: 3095, comm: ceph-osd Tainted: G         C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 1521.820591] RIP: 0010:[<ffffffffa02532f2>]  [<ffffffffa02532f2>] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 1521.820616] RSP: 0018:ffff881013da9d18  EFLAGS: 00010282
[ 1521.820626] RAX: 00000000fffffffe RBX: ffff881013a3b7f0 RCX: 
0000000000395dcf
[ 1521.820640] RDX: 0000000000395dce RSI: ffff88101df77480 RDI: 
ffffea004077ddc0
[ 1521.820654] RBP: ffff881013da9d58 R08: 000060ef800010d0 R09: 
ffffffffa022ac6a
[ 1521.820667] R10: 0000000000000000 R11: 000000000000010a R12: 
ffff88101e378790
[ 1521.820681] R13: ffff88101e378400 R14: 0000000000000001 R15: 
0000000000000001
[ 1521.820695] FS:  00007faa45d30700(0000) GS:ffff88107fc80000(0000) 
knlGS:0000000000000000
[ 1521.820710] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1521.820738] CR2: 00007fe0efba6010 CR3: 0000001016fec000 CR4: 
00000000000407e0
[ 1521.820767] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1521.820796] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1521.820825] Process ceph-osd (pid: 3095, threadinfo ffff881013da8000, 
task ffff881013da44a0)
[ 1521.820870] Stack:
[ 1521.820889]  0c00000000000005 ffff88101df9c230 ffff881013da9d38 
ffff88101df9c230
[ 1521.820939]  0000000000000000 ffff88101e378400 ffff881013a3b7f0 
ffff880c6880f840
[ 1521.820988]  ffff881013da9e08 ffffffffa0257628 ffff881013a3b7f0 
0000000000000000
[ 1521.821038] Call Trace:
[ 1521.821066]  [<ffffffffa0257628>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 1521.821096]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 1521.821128]  [<ffffffffa0259021>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 1521.821156]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 1521.821183]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 1521.821209]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 1521.821237]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[ 1521.821265] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 1521.821458] RIP  [<ffffffffa02532f2>] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 1521.821492]  RSP <ffff881013da9d18>
[ 1521.821758] ---[ end trace aee4c5fe92ee2a67 ]---
[ 6888.637508] btrfs: truncated 1 orphans
[ 7641.701736] ------------[ cut here ]------------
[ 7641.701764] kernel BUG at fs/btrfs/inode.c:2220!
[ 7641.701789] invalid opcode: 0000 [#2] SMP
[ 7641.701816] CPU 3
[ 7641.701819] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses 
enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd 
aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev 
ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb dca megaraid_sas mdio
[ 7641.702000]
[ 7641.702030] Pid: 3064, comm: ceph-osd Tainted: G      D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 7641.702081] RIP: 0010:[<ffffffffa02532f2>]  [<ffffffffa02532f2>] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 7641.702140] RSP: 0018:ffff881013c51d18  EFLAGS: 00010282
[ 7641.702166] RAX: 00000000fffffffe RBX: ffff881010871130 RCX: 
00000000013df293
[ 7641.702195] RDX: 00000000013df292 RSI: ffff88101701c1b0 RDI: 
ffffea00405c0700
[ 7641.702224] RBP: ffff881013c51d58 R08: 000060ef800010d0 R09: 
ffffffffa022ac6a
[ 7641.702253] R10: 0000000000000000 R11: 000000000000013f R12: 
ffff88101e379390
[ 7641.702282] R13: ffff88101e379000 R14: 0000000000000001 R15: 
0000000000000001
[ 7641.702311] FS:  00007fcb27307700(0000) GS:ffff88107fc60000(0000) 
knlGS:0000000000000000
[ 7641.702368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7641.702395] CR2: 0000000010713018 CR3: 0000001016e95000 CR4: 
00000000000407e0
[ 7641.702425] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 7641.702454] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 7641.702484] Process ceph-osd (pid: 3064, threadinfo ffff881013c50000, 
task ffff881015e35b80)
[ 7641.702529] Stack:
[ 7641.702557]  0c00000000000004 ffff88101701d820 ffff881013c51d38 
ffff88101701d820
[ 7641.702618]  0000000000000000 ffff88101e379000 ffff881010871130 
ffff880503e70b80
[ 7641.702678]  ffff881013c51e08 ffffffffa0257628 ffff881010871130 
0000000000000000
[ 7641.702729] Call Trace:
[ 7641.702761]  [<ffffffffa0257628>] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 7641.702792]  [<ffffffff81188afd>] ? path_lookupat+0x6d/0x750
[ 7641.702828]  [<ffffffffa0259021>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 7641.702858]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 7641.702886]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 7641.702913]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 7641.702942]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[ 7641.702969] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec
[ 7641.703185] RIP  [<ffffffffa02532f2>] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 7641.703224]  RSP <ffff881013c51d18>
[ 7641.703591] ---[ end trace aee4c5fe92ee2a68 ]---


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 15:12                         ` Martin Mailand
@ 2012-05-17 19:43                           ` Josef Bacik
  2012-05-17 20:54                             ` Christian Brunner
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-17 19:43 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
> Hi Josef,
> no there was nothing above. Here the is another dmesg output.
> 

Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..7cc1c96 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2166,8 +2166,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2180,6 +2180,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2198,6 +2199,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&root->orphan_lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&root->orphan_lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2227,13 +2231,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
-		delete_item = 1;
+	if (BTRFS_I(inode)->has_orphan_item) {
+		if (trans) {
+			BTRFS_I(inode)->has_orphan_item = 0;
+			delete_item = 1;
+		} else {
+			WARN_ON(1);
+		}
 	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
+	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
@@ -2247,6 +2259,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2385,7 +2400,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * the proper thing when we hit it
 		 */
 		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
 		spin_unlock(&root->orphan_lock);
 
 		/* if we have links, this was a truncate, lets do that */
@@ -3707,7 +3724,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6866,6 +6883,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6879,7 +6897,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6924,13 +6941,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 19:43                           ` Josef Bacik
@ 2012-05-17 20:54                             ` Christian Brunner
  2012-05-17 21:18                               ` Martin Mailand
  0 siblings, 1 reply; 66+ messages in thread
From: Christian Brunner @ 2012-05-17 20:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

2012/5/17 Josef Bacik <josef@redhat.com>:
> On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
>> Hi Josef,
>> no there was nothing above. Here the is another dmesg output.
>>
>
> Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
> Thanks,
>
> Josef

Well, I hate to say it, but the new patch doesn't seem to change much...

Regards,
Christian

[  123.507444] Btrfs loaded
[  202.683630] device fsid 2aa7531c-0e3c-4955-8542-6aed7ab8c1a2 devid
1 transid 4 /dev/sda
[  202.693704] btrfs: use lzo compression
[  202.697999] btrfs: enabling inode map caching
[  202.702989] btrfs: enabling auto defrag
[  202.707190] btrfs: disk space caching is enabled
[  202.712721] btrfs flagging fs with big metadata feature
[  207.839761] device fsid f81ff6a1-c333-4daf-989f-a28139f15f08 devid
1 transid 4 /dev/sdb
[  207.849681] btrfs: use lzo compression
[  207.853987] btrfs: enabling inode map caching
[  207.858970] btrfs: enabling auto defrag
[  207.863173] btrfs: disk space caching is enabled
[  207.868635] btrfs flagging fs with big metadata feature
[  210.857328] device fsid 9b905faa-f4fa-4626-9cae-2cd0287b30f7 devid
1 transid 4 /dev/sdc
[  210.867265] btrfs: use lzo compression
[  210.871560] btrfs: enabling inode map caching
[  210.876550] btrfs: enabling auto defrag
[  210.880757] btrfs: disk space caching is enabled
[  210.886228] btrfs flagging fs with big metadata feature
[  214.296287] device fsid f7990e4c-90b0-4691-9502-92b60538574a devid
1 transid 4 /dev/sdd
[  214.306510] btrfs: use lzo compression
[  214.310855] btrfs: enabling inode map caching
[  214.315905] btrfs: enabling auto defrag
[  214.320174] btrfs: disk space caching is enabled
[  214.325706] btrfs flagging fs with big metadata feature
[ 1337.937379] ------------[ cut here ]------------
[ 1337.942526] kernel BUG at fs/btrfs/inode.c:2224!
[ 1337.947671] invalid opcode: 0000 [#1] SMP
[ 1337.952255] CPU 5
[ 1337.954300] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg pcspkr serio_raw iTCO_wdt
iTCO_vendor_support iomemory_vsl(PO) ixgbe dca mdio i7core_edac
edac_core hpsa squashfs [last unloaded: scsi_wait_scan]
[ 1337.978570]
[ 1337.980230] Pid: 6812, comm: ceph-osd Tainted: P           O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[ 1337.991592] RIP: 0010:[<ffffffffa035675c>]  [<ffffffffa035675c>]
btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.001897] RSP: 0018:ffff8805e1171d38  EFLAGS: 00010282
[ 1338.007815] RAX: 00000000fffffffe RBX: ffff88061c3c8400 RCX: 0000000000b37f48
[ 1338.015768] RDX: 0000000000b37f47 RSI: ffff8805ec2a1cf0 RDI: ffffea0017b0a840
[ 1338.023724] RBP: ffff8805e1171d68 R08: 000060f9d88028a0 R09: ffffffffa033016a
[ 1338.031675] R10: 0000000000000000 R11: 0000000000000004 R12: ffff8805de7f57a0
[ 1338.039629] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8805ec2a5280
[ 1338.047584] FS:  00007f4bffc6e700(0000) GS:ffff8806272a0000(0000)
knlGS:0000000000000000
[ 1338.056600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1338.063003] CR2: ffffffffff600400 CR3: 00000005e34c3000 CR4: 00000000000006e0
[ 1338.070954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1338.078909] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1338.086865] Process ceph-osd (pid: 6812, threadinfo
ffff8805e1170000, task ffff88060fa81940)
[ 1338.096268] Stack:
[ 1338.098509]  ffff8805e1171d68 ffff8805ec2a5280 ffff88051235b920
0000000000000000
[ 1338.106795]  ffff88051235b920 0000000000080000 ffff8805e1171e08
ffffffffa036043c
[ 1338.115082]  0000000000000000 0000000000000000 0000000000000000
0001000000001000
[ 1338.123367] Call Trace:
[ 1338.126111]  [<ffffffffa036043c>] btrfs_truncate+0x5bc/0x640 [btrfs]
[ 1338.133213]  [<ffffffffa03605b6>] btrfs_setattr+0xf6/0x1a0 [btrfs]
[ 1338.140105]  [<ffffffff811816fb>] notify_change+0x18b/0x2b0
[ 1338.146320]  [<ffffffff81276541>] ? selinux_inode_permission+0xd1/0x130
[ 1338.153699]  [<ffffffff81165f44>] do_truncate+0x64/0xa0
[ 1338.159527]  [<ffffffff81172669>] ? inode_permission+0x49/0x100
[ 1338.166128]  [<ffffffff81166197>] sys_truncate+0x137/0x150
[ 1338.172244]  [<ffffffff8158b1e9>] system_call_fastpath+0x16/0x1b
[ 1338.178936] Code: 89 e7 e8 88 7d fe ff eb 89 66 0f 1f 44 00 00 be
a4 08 00 00 48 c7 c7 59 49 3b a0 45 31 ed e8 5c 78 cf e0 45 31 f6 e9
30 ff ff ff <0f> 0b eb fe 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c 89 65
e0 4c
[ 1338.200623] RIP  [<ffffffffa035675c>] btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.208317]  RSP <ffff8805e1171d38>
[ 1338.212681] ---[ end trace 86be14f0f863ea79 ]---

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 20:54                             ` Christian Brunner
@ 2012-05-17 21:18                               ` Martin Mailand
  2012-05-18 14:48                                 ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-17 21:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

I hit exact the same bug as Christian with your last patch.

-martin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-17 21:18                               ` Martin Mailand
@ 2012-05-18 14:48                                 ` Josef Bacik
  2012-05-18 17:24                                   ` Martin Mailand
  0 siblings, 1 reply; 66+ messages in thread
From: Josef Bacik @ 2012-05-18 14:48 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Thu, May 17, 2012 at 11:18:25PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> I hit exact the same bug as Christian with your last patch.
> 

Ok hopefully this will print something out that makes sense.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..492c74f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -156,6 +153,8 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
+	unsigned doing_truncate:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..7de7f6f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2166,6 +2167,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&root->orphan_lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&root->orphan_lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2195,13 +2199,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
-		delete_item = 1;
+	if (BTRFS_I(inode)->has_orphan_item) {
+		if (trans) {
+			BTRFS_I(inode)->has_orphan_item = 0;
+			delete_item = 1;
+		} else {
+			WARN_ON(1);
+		}
 	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
+	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
@@ -2209,12 +2221,18 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu\n",
+			       btrfs_ino(inode));
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2341,6 +2359,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2353,7 +2373,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * the proper thing when we hit it
 		 */
 		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
 		spin_unlock(&root->orphan_lock);
 
 		/* if we have links, this was a truncate, lets do that */
@@ -3671,7 +3693,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6683,9 +6705,13 @@ static int btrfs_truncate(struct inode *inode)
 	u64 mask = root->sectorsize - 1;
 	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
 
+	spin_lock(&BTRFS_I(inode)->lock);
+	BUG_ON(BTRFS_I(inode)->doing_truncate);
+	BTRFS_I(inode)->doing_truncate = 0;
+	spin_unlock(&BTRFS_I(inode)->lock);
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6727,8 +6753,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6847,7 +6875,10 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
+	spin_lock(&BTRFS_I(inode)->lock);
+	BTRFS_I(inode)->doing_truncate = 0;
+	spin_unlock(&BTRFS_I(inode)->lock);
 	if (ret && !err)
 		err = ret;
 
@@ -6914,6 +6945,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
+	ei->doing_truncate = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6927,7 +6960,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +7004,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-18 14:48                                 ` Josef Bacik
@ 2012-05-18 17:24                                   ` Martin Mailand
  2012-05-18 19:01                                     ` Josef Bacik
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-18 17:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,
there was one line before the bug.

[  995.725105] couldn't find orphan item for 524


Am 18.05.2012 16:48, schrieb Josef Bacik:
> Ok hopefully this will print something out that makes sense.  Thanks,

-martin

[  241.754693] Btrfs loaded
[  241.755148] device fsid 43c4ebd9-3824-4b07-a710-3ec39b012759 devid 1 
transid 4 /dev/sdc
[  241.755750] btrfs: setting nodatacow
[  241.755753] btrfs: enabling auto defrag
[  241.755754] btrfs: disk space caching is enabled
[  241.755755] btrfs flagging fs with big metadata feature
[  241.768683] device fsid e7e7f2df-6a4e-45b1-85cc-860cda849953 devid 1 
transid 4 /dev/sdd
[  241.769028] btrfs: setting nodatacow
[  241.769030] btrfs: enabling auto defrag
[  241.769031] btrfs: disk space caching is enabled
[  241.769032] btrfs flagging fs with big metadata feature
[  241.781360] device fsid 203fdd4c-baac-49f8-bfdb-08486c937989 devid 1 
transid 4 /dev/sde
[  241.781854] btrfs: setting nodatacow
[  241.781859] btrfs: enabling auto defrag
[  241.781861] btrfs: disk space caching is enabled
[  241.781864] btrfs flagging fs with big metadata feature
[  242.713741] device fsid 95c36e12-0098-48d7-a08d-9d54a299206b devid 1 
transid 4 /dev/sdf
[  242.714110] btrfs: setting nodatacow
[  242.714118] btrfs: enabling auto defrag
[  242.714121] btrfs: disk space caching is enabled
[  242.714125] btrfs flagging fs with big metadata feature
[  995.725105] couldn't find orphan item for 524
[  995.725126] ------------[ cut here ]------------
[  995.725134] kernel BUG at fs/btrfs/inode.c:2227!
[  995.725143] invalid opcode: 0000 [#1] SMP
[  995.725158] CPU 0
[  995.725162] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
coretemp ghash_clmulni_intel aesni_intel bonding cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma 
enclosure mac_hid lp parport ixgbe usbhid hid isci libsas megaraid_sas 
scsi_transport_sas igb dca mdio
[  995.725285]
[  995.725290] Pid: 2972, comm: ceph-osd Tainted: G         C 
3.4.0-rc7.2012051800+ #14 Supermicro X9SRi/X9SRi
[  995.725324] RIP: 0010:[<ffffffffa028535f>]  [<ffffffffa028535f>] 
btrfs_orphan_del+0x14f/0x160 [btrfs]
[  995.725354] RSP: 0018:ffff881016ed9d18  EFLAGS: 00010292
[  995.725364] RAX: 0000000000000037 RBX: ffff88101485fdb0 RCX: 
00000000ffffffff
[  995.725378] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 
0000000000000246
[  995.725392] RBP: ffff881016ed9d58 R08: 0000000000000000 R09: 
0000000000000000
[  995.725405] R10: 0000000000000000 R11: 00000000000000b6 R12: 
ffff88101efe9f90
[  995.725419] R13: ffff88101efe9c00 R14: 0000000000000001 R15: 
0000000000000001
[  995.725433] FS:  00007f58e5dbc700(0000) GS:ffff88107fc00000(0000) 
knlGS:0000000000000000
[  995.725466] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  995.725492] CR2: 0000000003f28000 CR3: 000000101acac000 CR4: 
00000000000407f0
[  995.725522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  995.725551] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  995.725581] Process ceph-osd (pid: 2972, threadinfo ffff881016ed8000, 
task ffff881016180000)
[  995.725626] Stack:
[  995.725646]  0c00000000000002 ffff88101deaf550 ffff881016ed9d38 
ffff88101deaf550
[  995.725700]  0000000000000000 ffff88101efe9c00 ffff88101485fdb0 
ffff880be890c1e0
[  995.725757]  ffff881016ed9e08 ffffffffa02897a8 ffff88101485fdb0 
0000000000000000
[  995.725807] Call Trace:
[  995.725835]  [<ffffffffa02897a8>] btrfs_truncate+0x5e8/0x6d0 [btrfs]
[  995.725869]  [<ffffffffa028b121>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[  995.725898]  [<ffffffff811955c3>] notify_change+0x183/0x320
[  995.725925]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[  995.725951]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[  995.725979]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[  995.726006] Code: 45 31 ff e9 3c ff ff ff 48 8b b3 58 fe ff ff 48 85 
f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 08 48 2e a0 31 c0 e8 09 7c 
3c e1 <0f> 0b 48 8b 73 40 eb ea 66 0f 1f 84 00 00 00 00 00 55 48 89 e5
[  995.726221] RIP  [<ffffffffa028535f>] btrfs_orphan_del+0x14f/0x160 
[btrfs]
[  995.726258]  RSP <ffff881016ed9d18>
[  995.726574] ---[ end trace 4bde8f513a6d106d ]---


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-18 17:24                                   ` Martin Mailand
@ 2012-05-18 19:01                                     ` Josef Bacik
  2012-05-18 20:11                                       ` Martin Mailand
  2012-05-21  3:59                                       ` Miao Xie
  0 siblings, 2 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-18 19:01 UTC (permalink / raw)
  To: Martin Mailand
  Cc: Josef Bacik, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

On Fri, May 18, 2012 at 07:24:25PM +0200, Martin Mailand wrote:
> Hi Josef,
> there was one line before the bug.
> 
> [  995.725105] couldn't find orphan item for 524
> 
> 

*sigh* ok try this, hopefully it will point me in the right direction.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..492c74f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -156,6 +153,8 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
+	unsigned doing_truncate:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..572da13 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2166,6 +2167,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&root->orphan_lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&root->orphan_lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2195,13 +2199,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
-		delete_item = 1;
+	if (BTRFS_I(inode)->has_orphan_item) {
+		if (trans) {
+			BTRFS_I(inode)->has_orphan_item = 0;
+			delete_item = 1;
+		} else {
+			WARN_ON(1);
+		}
 	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
+	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
@@ -2209,12 +2221,19 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
+			       btrfs_ino(inode), inode->i_nlink, root->objectid,
+			       root->orphan_item_inserted ? "yes" : "no");
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2341,6 +2360,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2353,7 +2374,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * the proper thing when we hit it
 		 */
 		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
 		spin_unlock(&root->orphan_lock);
 
 		/* if we have links, this was a truncate, lets do that */
@@ -3671,7 +3694,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6683,9 +6706,13 @@ static int btrfs_truncate(struct inode *inode)
 	u64 mask = root->sectorsize - 1;
 	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
 
+	spin_lock(&BTRFS_I(inode)->lock);
+	BUG_ON(BTRFS_I(inode)->doing_truncate);
+	BTRFS_I(inode)->doing_truncate = 0;
+	spin_unlock(&BTRFS_I(inode)->lock);
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6727,8 +6754,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6847,7 +6876,10 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
+	spin_lock(&BTRFS_I(inode)->lock);
+	BTRFS_I(inode)->doing_truncate = 0;
+	spin_unlock(&BTRFS_I(inode)->lock);
 	if (ret && !err)
 		err = ret;
 
@@ -6914,6 +6946,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
+	ei->doing_truncate = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6927,7 +6961,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +7005,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-18 19:01                                     ` Josef Bacik
@ 2012-05-18 20:11                                       ` Martin Mailand
  2012-05-21  3:59                                       ` Miao Xie
  1 sibling, 0 replies; 66+ messages in thread
From: Martin Mailand @ 2012-05-18 20:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,
now I get
[ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, 
root being deleted no

-martin

Am 18.05.2012 21:01, schrieb Josef Bacik:
> *sigh*  ok try this, hopefully it will point me in the right direction.  Thanks,


[  126.389847] Btrfs loaded
[  126.390284] device fsid 0c9d8c6d-2982-4604-b32a-fc443c4e2c50 devid 1 
transid 4 /dev/sdc
[  126.391246] btrfs: setting nodatacow
[  126.391252] btrfs: enabling auto defrag
[  126.391254] btrfs: disk space caching is enabled
[  126.391257] btrfs flagging fs with big metadata feature
[  126.405700] device fsid e8a0dc27-8714-49bd-a14f-ac37525febb1 devid 1 
transid 4 /dev/sdd
[  126.406162] btrfs: setting nodatacow
[  126.406167] btrfs: enabling auto defrag
[  126.406170] btrfs: disk space caching is enabled
[  126.406172] btrfs flagging fs with big metadata feature
[  126.419819] device fsid f67cd977-ebf4-41f2-9821-f2989e985954 devid 1 
transid 4 /dev/sde
[  126.420198] btrfs: setting nodatacow
[  126.420206] btrfs: enabling auto defrag
[  126.420210] btrfs: disk space caching is enabled
[  126.420214] btrfs flagging fs with big metadata feature
[  127.274555] device fsid 3001355e-c2e2-46c7-9eba-dfecb441d6a6 devid 1 
transid 4 /dev/sdf
[  127.274980] btrfs: setting nodatacow
[  127.274986] btrfs: enabling auto defrag
[  127.274989] btrfs: disk space caching is enabled
[  127.274992] btrfs flagging fs with big metadata feature
[ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, 
root being deleted no
[ 2081.142735] ------------[ cut here ]------------
[ 2081.142750] kernel BUG at fs/btrfs/inode.c:2228!
[ 2081.142766] invalid opcode: 0000 [#1] SMP
[ 2081.142786] CPU 10
[ 2081.142794] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ioatdma ses 
enclosure mac_hid lp parport usbhid hid megaraid_sas isci libsas 
scsi_transport_sas igb ixgbe dca mdio
[ 2081.142974]
[ 2081.142985] Pid: 2966, comm: ceph-osd Tainted: G         C 
3.4.0-rc7.2012051802+ #16 Supermicro X9SRi/X9SRi
[ 2081.143020] RIP: 0010:[<ffffffffa0269383>]  [<ffffffffa0269383>] 
btrfs_orphan_del+0x173/0x180 [btrfs]
[ 2081.143080] RSP: 0018:ffff881016d83d18  EFLAGS: 00010292
[ 2081.143096] RAX: 0000000000000062 RBX: ffff881017ad4770 RCX: 
00000000ffffffff
[ 2081.143115] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 
0000000000000246
[ 2081.143134] RBP: ffff881016d83d58 R08: 0000000000000000 R09: 
0000000000000000
[ 2081.143154] R10: 0000000000000000 R11: 0000000000000116 R12: 
ffff88101e7baf90
[ 2081.143173] R13: ffff88101e7bac00 R14: 0000000000000001 R15: 
0000000000000001
[ 2081.143193] FS:  00007fcc1e736700(0000) GS:ffff88107fd40000(0000) 
knlGS:0000000000000000
[ 2081.143243] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2081.143274] CR2: 0000000009269000 CR3: 000000101ba87000 CR4: 
00000000000407e0
[ 2081.143308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 2081.143341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 2081.143376] Process ceph-osd (pid: 2966, threadinfo ffff881016d82000, 
task ffff881023c744a0)
[ 2081.143424] Stack:
[ 2081.143447]  0c00000000000007 ffff88101e1dac30 ffff881016d83d38 
ffff88101e1dac30
[ 2081.143510]  0000000000000000 ffff88101e7bac00 ffff881017ad4770 
ffff88101f0f7d60
[ 2081.143572]  ffff881016d83e08 ffffffffa026d7c8 ffff881017ad4770 
0000000000000000
[ 2081.143634] Call Trace:
[ 2081.143684]  [<ffffffffa026d7c8>] btrfs_truncate+0x5e8/0x6d0 [btrfs]
[ 2081.143737]  [<ffffffffa026f141>] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 2081.143773]  [<ffffffff811955c3>] notify_change+0x183/0x320
[ 2081.143807]  [<ffffffff8117889e>] do_truncate+0x5e/0xa0
[ 2081.143839]  [<ffffffff81178a24>] sys_truncate+0x144/0x1b0
[ 2081.143873]  [<ffffffff8165fd29>] system_call_fastpath+0x16/0x1b
[ 2081.143903] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 44 c0 48 85 
f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 10 88 2c a0 31 c0 e8 e5 3b 
3e e1 <0f> 0b 48 8b 73 40 eb ea 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10
[ 2081.144199] RIP  [<ffffffffa0269383>] btrfs_orphan_del+0x173/0x180 
[btrfs]
[ 2081.144258]  RSP <ffff881016d83d18>
[ 2081.144614] ---[ end trace 8d0829d100639242 ]---


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-18 19:01                                     ` Josef Bacik
  2012-05-18 20:11                                       ` Martin Mailand
@ 2012-05-21  3:59                                       ` Miao Xie
  2012-05-22 10:29                                           ` Christian Brunner
  2012-05-22 13:31                                         ` Josef Bacik
  1 sibling, 2 replies; 66+ messages in thread
From: Miao Xie @ 2012-05-21  3:59 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Martin Mailand, Christian Brunner, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 9b9b15f..492c74f 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -57,9 +57,6 @@ struct btrfs_inode {
>  	/* used to order data wrt metadata */
>  	struct btrfs_ordered_inode_tree ordered_tree;
>  
> -	/* for keeping track of orphaned inodes */
> -	struct list_head i_orphan;
> -
>  	/* list of all the delalloc inodes in the FS.  There are times we need
>  	 * to write all the delalloc pages to disk, and this list is used
>  	 * to walk them all.
> @@ -156,6 +153,8 @@ struct btrfs_inode {
>  	unsigned dummy_inode:1;
>  	unsigned in_defrag:1;
>  	unsigned delalloc_meta_reserved:1;
> +	unsigned has_orphan_item:1;
> +	unsigned doing_truncate:1;

I think the problem is we should not use the different lock to protect the bit fields which
are stored in the same machine word. Or some bit fields may be covered by the others when
someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
as a integer?

Thanks
Miao

>  
>  	/*
>  	 * always compress this one file
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 8fd7233..aad2600 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1375,7 +1375,7 @@ struct btrfs_root {
>  	struct list_head root_list;
>  
>  	spinlock_t orphan_lock;
> -	struct list_head orphan_list;
> +	atomic_t orphan_inodes;
>  	struct btrfs_block_rsv *orphan_block_rsv;
>  	int orphan_item_inserted;
>  	int orphan_cleanup_state;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a7ffc88..ff3bf4b 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>  	root->orphan_block_rsv = NULL;
>  
>  	INIT_LIST_HEAD(&root->dirty_list);
> -	INIT_LIST_HEAD(&root->orphan_list);
>  	INIT_LIST_HEAD(&root->root_list);
>  	spin_lock_init(&root->orphan_lock);
>  	spin_lock_init(&root->inode_lock);
> @@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>  	atomic_set(&root->log_commit[0], 0);
>  	atomic_set(&root->log_commit[1], 0);
>  	atomic_set(&root->log_writers, 0);
> +	atomic_set(&root->orphan_inodes, 0);
>  	root->log_batch = 0;
>  	root->log_transid = 0;
>  	root->last_log_commit = 0;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 61b16c6..572da13 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
>  	struct btrfs_block_rsv *block_rsv;
>  	int ret;
>  
> -	if (!list_empty(&root->orphan_list) ||
> +	if (atomic_read(&root->orphan_inodes) ||
>  	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
>  		return;
>  
>  	spin_lock(&root->orphan_lock);
> -	if (!list_empty(&root->orphan_list)) {
> +	if (atomic_read(&root->orphan_inodes)) {
>  		spin_unlock(&root->orphan_lock);
>  		return;
>  	}
> @@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
>  		block_rsv = NULL;
>  	}
>  
> -	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
> -		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
> +	if (!BTRFS_I(inode)->has_orphan_item) {
> +		BTRFS_I(inode)->has_orphan_item = 1;
>  #if 0
>  		/*
>  		 * For proper ENOSPC handling, we should do orphan
> @@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
>  			insert = 1;
>  #endif
>  		insert = 1;
> +		atomic_inc(&root->orphan_inodes);
>  	}
>  
>  	if (!BTRFS_I(inode)->orphan_meta_reserved) {
> @@ -2166,6 +2167,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
>  	if (insert >= 1) {
>  		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
>  		if (ret && ret != -EEXIST) {
> +			spin_lock(&root->orphan_lock);
> +			BTRFS_I(inode)->has_orphan_item = 0;
> +			spin_unlock(&root->orphan_lock);
>  			btrfs_abort_transaction(trans, root, ret);
>  			return ret;
>  		}
> @@ -2195,13 +2199,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
>  	int release_rsv = 0;
>  	int ret = 0;
>  
> +	/*
> +	 * evict_inode gets called without holding the i_mutex so we need to
> +	 * take the orphan lock to make sure we are safe in messing with these.
> +	 */
>  	spin_lock(&root->orphan_lock);
> -	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
> -		list_del_init(&BTRFS_I(inode)->i_orphan);
> -		delete_item = 1;
> +	if (BTRFS_I(inode)->has_orphan_item) {
> +		if (trans) {
> +			BTRFS_I(inode)->has_orphan_item = 0;
> +			delete_item = 1;
> +		} else {
> +			WARN_ON(1);
> +		}
>  	}
>  
> -	if (BTRFS_I(inode)->orphan_meta_reserved) {
> +	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
>  		BTRFS_I(inode)->orphan_meta_reserved = 0;
>  		release_rsv = 1;
>  	}
> @@ -2209,12 +2221,19 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
>  
>  	if (trans && delete_item) {
>  		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
> +		if (ret)
> +			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
> +			       btrfs_ino(inode), inode->i_nlink, root->objectid,
> +			       root->orphan_item_inserted ? "yes" : "no");
>  		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
>  	}
>  
>  	if (release_rsv)
>  		btrfs_orphan_release_metadata(inode);
>  
> +	if (trans && delete_item)
> +		atomic_dec(&root->orphan_inodes);
> +
>  	return 0;
>  }
>  
> @@ -2341,6 +2360,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
>  				ret = PTR_ERR(trans);
>  				goto out;
>  			}
> +			printk(KERN_ERR "auto deleting %Lu\n",
> +			       found_key.objectid);
>  			ret = btrfs_del_orphan_item(trans, root,
>  						    found_key.objectid);
>  			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
> @@ -2353,7 +2374,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
>  		 * the proper thing when we hit it
>  		 */
>  		spin_lock(&root->orphan_lock);
> -		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
> +		atomic_inc(&root->orphan_inodes);
> +		WARN_ON(BTRFS_I(inode)->has_orphan_item);
> +		BTRFS_I(inode)->has_orphan_item = 1;
>  		spin_unlock(&root->orphan_lock);
>  
>  		/* if we have links, this was a truncate, lets do that */
> @@ -3671,7 +3694,7 @@ void btrfs_evict_inode(struct inode *inode)
>  	btrfs_wait_ordered_range(inode, 0, (u64)-1);
>  
>  	if (root->fs_info->log_root_recovering) {
> -		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
> +		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
>  		goto no_delete;
>  	}
>  
> @@ -6683,9 +6706,13 @@ static int btrfs_truncate(struct inode *inode)
>  	u64 mask = root->sectorsize - 1;
>  	u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
>  
> +	spin_lock(&BTRFS_I(inode)->lock);
> +	BUG_ON(BTRFS_I(inode)->doing_truncate);
> +	BTRFS_I(inode)->doing_truncate = 0;
> +	spin_unlock(&BTRFS_I(inode)->lock);
>  	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
>  	if (ret)
> -		return ret;
> +		goto real_out;
>  
>  	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
>  	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
> @@ -6727,8 +6754,10 @@ static int btrfs_truncate(struct inode *inode)
>  	 * updating the inode.
>  	 */
>  	rsv = btrfs_alloc_block_rsv(root);
> -	if (!rsv)
> -		return -ENOMEM;
> +	if (!rsv) {
> +		ret = -ENOMEM;
> +		goto real_out;
> +	}
>  	rsv->size = min_size;
>  
>  	/*
> @@ -6847,7 +6876,10 @@ end_trans:
>  
>  out:
>  	btrfs_free_block_rsv(root, rsv);
> -
> +real_out:
> +	spin_lock(&BTRFS_I(inode)->lock);
> +	BTRFS_I(inode)->doing_truncate = 0;
> +	spin_unlock(&BTRFS_I(inode)->lock);
>  	if (ret && !err)
>  		err = ret;
>  
> @@ -6914,6 +6946,8 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>  	ei->dummy_inode = 0;
>  	ei->in_defrag = 0;
>  	ei->delalloc_meta_reserved = 0;
> +	ei->has_orphan_item = 0;
> +	ei->doing_truncate = 0;
>  	ei->force_compress = BTRFS_COMPRESS_NONE;
>  
>  	ei->delayed_node = NULL;
> @@ -6927,7 +6961,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>  	mutex_init(&ei->log_mutex);
>  	mutex_init(&ei->delalloc_mutex);
>  	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
> -	INIT_LIST_HEAD(&ei->i_orphan);
>  	INIT_LIST_HEAD(&ei->delalloc_inodes);
>  	INIT_LIST_HEAD(&ei->ordered_operations);
>  	RB_CLEAR_NODE(&ei->rb_node);
> @@ -6972,13 +7005,11 @@ void btrfs_destroy_inode(struct inode *inode)
>  		spin_unlock(&root->fs_info->ordered_extent_lock);
>  	}
>  
> -	spin_lock(&root->orphan_lock);
> -	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
> +	if (BTRFS_I(inode)->has_orphan_item) {
>  		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
>  		       (unsigned long long)btrfs_ino(inode));
> -		list_del_init(&BTRFS_I(inode)->i_orphan);
> +		atomic_dec(&root->orphan_inodes);
>  	}
> -	spin_unlock(&root->orphan_lock);
>  
>  	while (1) {
>  		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-21  3:59                                       ` Miao Xie
@ 2012-05-22 10:29                                           ` Christian Brunner
  2012-05-22 13:31                                         ` Josef Bacik
  1 sibling, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-22 10:29 UTC (permalink / raw)
  To: miaox; +Cc: Josef Bacik, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

2012/5/21 Miao Xie <miaox@cn.fujitsu.com>:
> Hi Josef,
>
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 9b9b15f..492c74f 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -57,9 +57,6 @@ struct btrfs_inode {
>>       /* used to order data wrt metadata */
>>       struct btrfs_ordered_inode_tree ordered_tree;
>>
>> -     /* for keeping track of orphaned inodes */
>> -     struct list_head i_orphan;
>> -
>>       /* list of all the delalloc inodes in the FS.  There are times we need
>>        * to write all the delalloc pages to disk, and this list is used
>>        * to walk them all.
>> @@ -156,6 +153,8 @@ struct btrfs_inode {
>>       unsigned dummy_inode:1;
>>       unsigned in_defrag:1;
>>       unsigned delalloc_meta_reserved:1;
>> +     unsigned has_orphan_item:1;
>> +     unsigned doing_truncate:1;
>
> I think the problem is we should not use the different lock to protect the bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the others when
> someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?

I have tried changing it to:

struct btrfs_inode {
        unsigned orphan_meta_reserved:1;
        unsigned dummy_inode:1;
        unsigned in_defrag:1;
-       unsigned delalloc_meta_reserved:1;
+       int delalloc_meta_reserved;
+       int has_orphan_item;
+       int doing_truncate;

The strange thing is, that I'm no longer hitting the BUG_ON, but the
old WARNING (no additional messages):

[351021.157124] ------------[ cut here ]------------
[351021.162400] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351021.171812] Hardware name: ProLiant DL180 G6
[351021.176867] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351021.200236] Pid: 9837, comm: btrfs-transacti Tainted: P        W
O 3.3.5-1.fits.1.el6.x86_64 #1
[351021.210126] Call Trace:
[351021.212957]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[351021.219758]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[351021.226385]  [<ffffffffa03eb627>]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351021.234461]  [<ffffffffa03e6976>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351021.241669]  [<ffffffffa0438c61>] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[351021.249841]  [<ffffffffa03e7ae4>]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351021.258006]  [<ffffffffa03e8432>] ? start_transaction+0x92/0x310 [btrfs]
[351021.265580]  [<ffffffff81070aa0>] ? wake_up_bit+0x40/0x40
[351021.271719]  [<ffffffffa03e2f3b>] transaction_kthread+0x26b/0x2e0 [btrfs]
[351021.279405]  [<ffffffffa03e2cd0>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.288934]  [<ffffffffa03e2cd0>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.298449]  [<ffffffff8107040e>] kthread+0x9e/0xb0
[351021.303989]  [<ffffffff8158c5a4>] kernel_thread_helper+0x4/0x10
[351021.310691]  [<ffffffff81070370>] ? kthread_freezable_should_stop+0x70/0x70
[351021.318555]  [<ffffffff8158c5a0>] ? gs_change+0x13/0x13
[351021.324479] ---[ end trace 9adc7b36a3e66833 ]---
[351710.339482] ------------[ cut here ]------------
[351710.344754] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351710.354165] Hardware name: ProLiant DL180 G6
[351710.359222] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351710.382569] Pid: 9797, comm: kworker/5:0 Tainted: P        W  O
3.3.5-1.fits.1.el6.x86_64 #1
[351710.392075] Call Trace:
[351710.394901]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[351710.401750]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[351710.408414]  [<ffffffffa03eb627>]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351710.416528]  [<ffffffffa03e6976>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351710.423775]  [<ffffffffa03e7ae4>]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351710.431983]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
[351710.438352]  [<ffffffff81070aa0>] ? wake_up_bit+0x40/0x40
[351710.444529]  [<ffffffffa03e7fb0>] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[351710.452894]  [<ffffffffa03e7fcf>] do_async_commit+0x1f/0x30 [btrfs]
[351710.459979]  [<ffffffff81068959>] process_one_work+0x129/0x450
[351710.466576]  [<ffffffff8106b7fb>] worker_thread+0x17b/0x3c0
[351710.472884]  [<ffffffff8106b680>] ? manage_workers+0x220/0x220
[351710.479472]  [<ffffffff8107040e>] kthread+0x9e/0xb0
[351710.485029]  [<ffffffff8158c5a4>] kernel_thread_helper+0x4/0x10
[351710.491731]  [<ffffffff81070370>] ? kthread_freezable_should_stop+0x70/0x70
[351710.499640]  [<ffffffff8158c5a0>] ? gs_change+0x13/0x13
[351710.505590] ---[ end trace 9adc7b36a3e66834 ]---


Regards,
Christian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-22 10:29                                           ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-22 10:29 UTC (permalink / raw)
  To: miaox; +Cc: Josef Bacik, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

2012/5/21 Miao Xie <miaox@cn.fujitsu.com>:
> Hi Josef,
>
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 9b9b15f..492c74f 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -57,9 +57,6 @@ struct btrfs_inode {
>>       /* used to order data wrt metadata */
>>       struct btrfs_ordered_inode_tree ordered_tree;
>>
>> -     /* for keeping track of orphaned inodes */
>> -     struct list_head i_orphan;
>> -
>>       /* list of all the delalloc inodes in the FS.  There are times we need
>>        * to write all the delalloc pages to disk, and this list is used
>>        * to walk them all.
>> @@ -156,6 +153,8 @@ struct btrfs_inode {
>>       unsigned dummy_inode:1;
>>       unsigned in_defrag:1;
>>       unsigned delalloc_meta_reserved:1;
>> +     unsigned has_orphan_item:1;
>> +     unsigned doing_truncate:1;
>
> I think the problem is we should not use the different lock to protect the bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the others when
> someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?

I have tried changing it to:

struct btrfs_inode {
        unsigned orphan_meta_reserved:1;
        unsigned dummy_inode:1;
        unsigned in_defrag:1;
-       unsigned delalloc_meta_reserved:1;
+       int delalloc_meta_reserved;
+       int has_orphan_item;
+       int doing_truncate;

The strange thing is, that I'm no longer hitting the BUG_ON, but the
old WARNING (no additional messages):

[351021.157124] ------------[ cut here ]------------
[351021.162400] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351021.171812] Hardware name: ProLiant DL180 G6
[351021.176867] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351021.200236] Pid: 9837, comm: btrfs-transacti Tainted: P        W
O 3.3.5-1.fits.1.el6.x86_64 #1
[351021.210126] Call Trace:
[351021.212957]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[351021.219758]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[351021.226385]  [<ffffffffa03eb627>]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351021.234461]  [<ffffffffa03e6976>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351021.241669]  [<ffffffffa0438c61>] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[351021.249841]  [<ffffffffa03e7ae4>]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351021.258006]  [<ffffffffa03e8432>] ? start_transaction+0x92/0x310 [btrfs]
[351021.265580]  [<ffffffff81070aa0>] ? wake_up_bit+0x40/0x40
[351021.271719]  [<ffffffffa03e2f3b>] transaction_kthread+0x26b/0x2e0 [btrfs]
[351021.279405]  [<ffffffffa03e2cd0>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.288934]  [<ffffffffa03e2cd0>] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.298449]  [<ffffffff8107040e>] kthread+0x9e/0xb0
[351021.303989]  [<ffffffff8158c5a4>] kernel_thread_helper+0x4/0x10
[351021.310691]  [<ffffffff81070370>] ? kthread_freezable_should_stop+0x70/0x70
[351021.318555]  [<ffffffff8158c5a0>] ? gs_change+0x13/0x13
[351021.324479] ---[ end trace 9adc7b36a3e66833 ]---
[351710.339482] ------------[ cut here ]------------
[351710.344754] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351710.354165] Hardware name: ProLiant DL180 G6
[351710.359222] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351710.382569] Pid: 9797, comm: kworker/5:0 Tainted: P        W  O
3.3.5-1.fits.1.el6.x86_64 #1
[351710.392075] Call Trace:
[351710.394901]  [<ffffffff8104df6f>] warn_slowpath_common+0x7f/0xc0
[351710.401750]  [<ffffffff8104dfca>] warn_slowpath_null+0x1a/0x20
[351710.408414]  [<ffffffffa03eb627>]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351710.416528]  [<ffffffffa03e6976>] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351710.423775]  [<ffffffffa03e7ae4>]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351710.431983]  [<ffffffff810127a3>] ? __switch_to+0x153/0x440
[351710.438352]  [<ffffffff81070aa0>] ? wake_up_bit+0x40/0x40
[351710.444529]  [<ffffffffa03e7fb0>] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[351710.452894]  [<ffffffffa03e7fcf>] do_async_commit+0x1f/0x30 [btrfs]
[351710.459979]  [<ffffffff81068959>] process_one_work+0x129/0x450
[351710.466576]  [<ffffffff8106b7fb>] worker_thread+0x17b/0x3c0
[351710.472884]  [<ffffffff8106b680>] ? manage_workers+0x220/0x220
[351710.479472]  [<ffffffff8107040e>] kthread+0x9e/0xb0
[351710.485029]  [<ffffffff8158c5a4>] kernel_thread_helper+0x4/0x10
[351710.491731]  [<ffffffff81070370>] ? kthread_freezable_should_stop+0x70/0x70
[351710.499640]  [<ffffffff8158c5a0>] ? gs_change+0x13/0x13
[351710.505590] ---[ end trace 9adc7b36a3e66834 ]---


Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-21  3:59                                       ` Miao Xie
  2012-05-22 10:29                                           ` Christian Brunner
@ 2012-05-22 13:31                                         ` Josef Bacik
  1 sibling, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-22 13:31 UTC (permalink / raw)
  To: Miao Xie
  Cc: Josef Bacik, Martin Mailand, Christian Brunner, Sage Weil,
	linux-btrfs, ceph-devel

On Mon, May 21, 2012 at 11:59:54AM +0800, Miao Xie wrote:
> Hi Josef,
> 
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index 9b9b15f..492c74f 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -57,9 +57,6 @@ struct btrfs_inode {
> >  	/* used to order data wrt metadata */
> >  	struct btrfs_ordered_inode_tree ordered_tree;
> >  
> > -	/* for keeping track of orphaned inodes */
> > -	struct list_head i_orphan;
> > -
> >  	/* list of all the delalloc inodes in the FS.  There are times we need
> >  	 * to write all the delalloc pages to disk, and this list is used
> >  	 * to walk them all.
> > @@ -156,6 +153,8 @@ struct btrfs_inode {
> >  	unsigned dummy_inode:1;
> >  	unsigned in_defrag:1;
> >  	unsigned delalloc_meta_reserved:1;
> > +	unsigned has_orphan_item:1;
> > +	unsigned doing_truncate:1;
> 
> I think the problem is we should not use the different lock to protect the bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the others when
> someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?
> 

Oh freaking duh, thank you Miao, I'm an idiot.

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-22 10:29                                           ` Christian Brunner
@ 2012-05-22 17:33                                             ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-22 17:33 UTC (permalink / raw)
  To: Christian Brunner
  Cc: miaox, Josef Bacik, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Tue, May 22, 2012 at 12:29:59PM +0200, Christian Brunner wrote:
> 2012/5/21 Miao Xie <miaox@cn.fujitsu.com>:
> > Hi Josef,
> >
> > On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> >> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> >> index 9b9b15f..492c74f 100644
> >> --- a/fs/btrfs/btrfs_inode.h
> >> +++ b/fs/btrfs/btrfs_inode.h
> >> @@ -57,9 +57,6 @@ struct btrfs_inode {
> >>       /* used to order data wrt metadata */
> >>       struct btrfs_ordered_inode_tree ordered_tree;
> >>
> >> -     /* for keeping track of orphaned inodes */
> >> -     struct list_head i_orphan;
> >> -
> >>       /* list of all the delalloc inodes in the FS.  There are times we need
> >>        * to write all the delalloc pages to disk, and this list is used
> >>        * to walk them all.
> >> @@ -156,6 +153,8 @@ struct btrfs_inode {
> >>       unsigned dummy_inode:1;
> >>       unsigned in_defrag:1;
> >>       unsigned delalloc_meta_reserved:1;
> >> +     unsigned has_orphan_item:1;
> >> +     unsigned doing_truncate:1;
> >
> > I think the problem is we should not use the different lock to protect the bit fields which
> > are stored in the same machine word. Or some bit fields may be covered by the others when
> > someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
> > as a integer?
> 
> I have tried changing it to:
> 
> struct btrfs_inode {
>         unsigned orphan_meta_reserved:1;
>         unsigned dummy_inode:1;
>         unsigned in_defrag:1;
> -       unsigned delalloc_meta_reserved:1;
> +       int delalloc_meta_reserved;
> +       int has_orphan_item;
> +       int doing_truncate;
> 
> The strange thing is, that I'm no longer hitting the BUG_ON, but the
> old WARNING (no additional messages):
> 

Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
taking the BTRFS_I(inode)->lock when messing with these since we don't want to
take up all that space in the inode just for a marker.  I ran this patch for 3
hours with no issues, let me know if it works for you.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..54f1b30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2166,8 +2166,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	spin_lock(&BTRFS_I(inode)->lock);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2180,12 +2181,14 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 1;
 		reserve = 1;
 	}
+	spin_unlock(&BTRFS_I(inode)->lock);
 	spin_unlock(&root->orphan_lock);
 
 	/* grab metadata reservation from transaction handle */
@@ -2198,6 +2201,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&BTRFS_I(inode)->lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&BTRFS_I(inode)->lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2227,26 +2233,41 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
-		delete_item = 1;
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
+	spin_lock(&BTRFS_I(inode)->lock);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		if (trans) {
+			BTRFS_I(inode)->has_orphan_item = 0;
+			delete_item = 1;
+		} else {
+			WARN_ON(1);
+		}
 	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
+	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
-	spin_unlock(&root->orphan_lock);
+	spin_unlock(&BTRFS_I(inode)->lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
+			       btrfs_ino(inode), inode->i_nlink, root->objectid,
+			       root->orphan_item_inserted ? "yes" : "no");
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2373,6 +2394,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2384,9 +2407,11 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		spin_lock(&BTRFS_I(inode)->lock);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
+		spin_unlock(&BTRFS_I(inode)->lock);
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3707,7 +3732,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6638,7 +6663,7 @@ static int btrfs_truncate(struct inode *inode)
 
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6680,8 +6705,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6800,7 +6827,7 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
 	if (ret && !err)
 		err = ret;
 
@@ -6866,6 +6893,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6879,7 +6907,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6924,13 +6951,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-22 17:33                                             ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-22 17:33 UTC (permalink / raw)
  To: Christian Brunner
  Cc: miaox, Josef Bacik, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Tue, May 22, 2012 at 12:29:59PM +0200, Christian Brunner wrote:
> 2012/5/21 Miao Xie <miaox@cn.fujitsu.com>:
> > Hi Josef,
> >
> > On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> >> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> >> index 9b9b15f..492c74f 100644
> >> --- a/fs/btrfs/btrfs_inode.h
> >> +++ b/fs/btrfs/btrfs_inode.h
> >> @@ -57,9 +57,6 @@ struct btrfs_inode {
> >>       /* used to order data wrt metadata */
> >>       struct btrfs_ordered_inode_tree ordered_tree;
> >>
> >> -     /* for keeping track of orphaned inodes */
> >> -     struct list_head i_orphan;
> >> -
> >>       /* list of all the delalloc inodes in the FS.  There are times we need
> >>        * to write all the delalloc pages to disk, and this list is used
> >>        * to walk them all.
> >> @@ -156,6 +153,8 @@ struct btrfs_inode {
> >>       unsigned dummy_inode:1;
> >>       unsigned in_defrag:1;
> >>       unsigned delalloc_meta_reserved:1;
> >> +     unsigned has_orphan_item:1;
> >> +     unsigned doing_truncate:1;
> >
> > I think the problem is we should not use the different lock to protect the bit fields which
> > are stored in the same machine word. Or some bit fields may be covered by the others when
> > someone change those fields. Could you try to declare ->delalloc_meta_reserved and ->has_orphan_item
> > as a integer?
> 
> I have tried changing it to:
> 
> struct btrfs_inode {
>         unsigned orphan_meta_reserved:1;
>         unsigned dummy_inode:1;
>         unsigned in_defrag:1;
> -       unsigned delalloc_meta_reserved:1;
> +       int delalloc_meta_reserved;
> +       int has_orphan_item;
> +       int doing_truncate;
> 
> The strange thing is, that I'm no longer hitting the BUG_ON, but the
> old WARNING (no additional messages):
> 

Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
taking the BTRFS_I(inode)->lock when messing with these since we don't want to
take up all that space in the inode just for a marker.  I ran this patch for 3
hours with no issues, let me know if it works for you.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
 	unsigned dummy_inode:1;
 	unsigned in_defrag:1;
 	unsigned delalloc_meta_reserved:1;
+	unsigned has_orphan_item:1;
 
 	/*
 	 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..54f1b30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2166,8 +2166,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	spin_lock(&BTRFS_I(inode)->lock);
+	if (!BTRFS_I(inode)->has_orphan_item) {
+		BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2180,12 +2181,14 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
 	if (!BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 1;
 		reserve = 1;
 	}
+	spin_unlock(&BTRFS_I(inode)->lock);
 	spin_unlock(&root->orphan_lock);
 
 	/* grab metadata reservation from transaction handle */
@@ -2198,6 +2201,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			spin_lock(&BTRFS_I(inode)->lock);
+			BTRFS_I(inode)->has_orphan_item = 0;
+			spin_unlock(&BTRFS_I(inode)->lock);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2227,26 +2233,41 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
-		delete_item = 1;
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
+	spin_lock(&BTRFS_I(inode)->lock);
+	if (BTRFS_I(inode)->has_orphan_item) {
+		if (trans) {
+			BTRFS_I(inode)->has_orphan_item = 0;
+			delete_item = 1;
+		} else {
+			WARN_ON(1);
+		}
 	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
+	if (trans && BTRFS_I(inode)->orphan_meta_reserved) {
 		BTRFS_I(inode)->orphan_meta_reserved = 0;
 		release_rsv = 1;
 	}
-	spin_unlock(&root->orphan_lock);
+	spin_unlock(&BTRFS_I(inode)->lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
+			       btrfs_ino(inode), inode->i_nlink, root->objectid,
+			       root->orphan_item_inserted ? "yes" : "no");
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2373,6 +2394,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2384,9 +2407,11 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		spin_lock(&BTRFS_I(inode)->lock);
+		atomic_inc(&root->orphan_inodes);
+		WARN_ON(BTRFS_I(inode)->has_orphan_item);
+		BTRFS_I(inode)->has_orphan_item = 1;
+		spin_unlock(&BTRFS_I(inode)->lock);
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3707,7 +3732,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!BTRFS_I(inode)->has_orphan_item);
 		goto no_delete;
 	}
 
@@ -6638,7 +6663,7 @@ static int btrfs_truncate(struct inode *inode)
 
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6680,8 +6705,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6800,7 +6827,7 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
 	if (ret && !err)
 		err = ret;
 
@@ -6866,6 +6893,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->dummy_inode = 0;
 	ei->in_defrag = 0;
 	ei->delalloc_meta_reserved = 0;
+	ei->has_orphan_item = 0;
 	ei->force_compress = BTRFS_COMPRESS_NONE;
 
 	ei->delayed_node = NULL;
@@ -6879,7 +6907,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6924,13 +6951,11 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (BTRFS_I(inode)->has_orphan_item) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-22 17:33                                             ` Josef Bacik
@ 2012-05-23 12:34                                               ` Christian Brunner
  -1 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-23 12:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

2012/5/22 Josef Bacik <josef@redhat.com>:
>>
>
> Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> take up all that space in the inode just for a marker.  I ran this patch for 3
> hours with no issues, let me know if it works for you.  Thanks,

Compared to the last runs, I had to run it much longer, but somehow I
managed to hit a BUG_ON again:

[448281.002087] couldn't find orphan item for 2027, nlink 1, root 308,
root being deleted no
[448281.011339] ------------[ cut here ]------------
[448281.016590] kernel BUG at fs/btrfs/inode.c:2230!
[448281.021837] invalid opcode: 0000 [#1] SMP
[448281.026525] CPU 4
[448281.028670] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[448281.052215]
[448281.053977] Pid: 16018, comm: ceph-osd Tainted: P        W  O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[448281.065555] RIP: 0010:[<ffffffffa04a17ab>]  [<ffffffffa04a17ab>]
btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.075965] RSP: 0018:ffff880458257d18  EFLAGS: 00010292
[448281.081987] RAX: 0000000000000063 RBX: ffff8803a28ebc48 RCX:
0000000000002fdb
[448281.090042] RDX: 0000000000000000 RSI: 0000000000000046 RDI:
0000000000000246
[448281.098093] RBP: ffff880458257d58 R08: ffffffff81af6100 R09:
0000000000000000
[448281.106146] R10: 0000000000000004 R11: 0000000000000000 R12:
0000000000000001
[448281.114202] R13: ffff88052e130400 R14: 0000000000000001 R15:
ffff8805beae9e10
[448281.122262] FS:  00007fa2e772f700(0000) GS:ffff880627280000(0000)
knlGS:0000000000000000
[448281.131386] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[448281.137879] CR2: ffffffffff600400 CR3: 00000005015a5000 CR4:
00000000000006e0
[448281.145929] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[448281.153974] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[448281.162043] Process ceph-osd (pid: 16018, threadinfo
ffff880458256000, task ffff88055b711940)
[448281.171646] Stack:
[448281.173987]  ffff880458257dff ffff8803a28eba98 ffff880458257d58
ffff8805beae9e10
[448281.182377]  0000000000000000 ffff88052e130400 ffff88029ff33380
ffff8803a28ebc48
[448281.190766]  ffff880458257e08 ffffffffa04ab4e6 0000000000000000
ffff8803a28ebc48
[448281.199155] Call Trace:
[448281.202005]  [<ffffffffa04ab4e6>] btrfs_truncate+0x5f6/0x660 [btrfs]
[448281.209203]  [<ffffffffa04ab646>] btrfs_setattr+0xf6/0x1a0 [btrfs]
[448281.216202]  [<ffffffff811816fb>] notify_change+0x18b/0x2b0
[448281.222517]  [<ffffffff81276541>] ? selinux_inode_permission+0xd1/0x130
[448281.229990]  [<ffffffff81165f44>] do_truncate+0x64/0xa0
[448281.235919]  [<ffffffff81172669>] ? inode_permission+0x49/0x100
[448281.242617]  [<ffffffff81166197>] sys_truncate+0x137/0x150
[448281.248838]  [<ffffffff8158b1e9>] system_call_fastpath+0x16/0x1b
[448281.255631] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 45 c0 48
85 f6 74 1b 80 bb 60 fe ff ff 84 74 12 48 c7 c7 e8 1d 50 a0 31 c0 e8
9d ea 0d e1 <0f> 0b eb fe 48 8b 73 40 eb e8 66 66 2e 0f 1f 84 00 00 00
00 00
[448281.277435] RIP  [<ffffffffa04a17ab>] btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.285229]  RSP <ffff880458257d18>
[448281.289667] ---[ end trace 9adc7b36a3e66872 ]---

Sorry,
Christian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-23 12:34                                               ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-23 12:34 UTC (permalink / raw)
  To: Josef Bacik; +Cc: miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

2012/5/22 Josef Bacik <josef@redhat.com>:
>>
>
> Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> take up all that space in the inode just for a marker.  I ran this patch for 3
> hours with no issues, let me know if it works for you.  Thanks,

Compared to the last runs, I had to run it much longer, but somehow I
managed to hit a BUG_ON again:

[448281.002087] couldn't find orphan item for 2027, nlink 1, root 308,
root being deleted no
[448281.011339] ------------[ cut here ]------------
[448281.016590] kernel BUG at fs/btrfs/inode.c:2230!
[448281.021837] invalid opcode: 0000 [#1] SMP
[448281.026525] CPU 4
[448281.028670] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[448281.052215]
[448281.053977] Pid: 16018, comm: ceph-osd Tainted: P        W  O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[448281.065555] RIP: 0010:[<ffffffffa04a17ab>]  [<ffffffffa04a17ab>]
btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.075965] RSP: 0018:ffff880458257d18  EFLAGS: 00010292
[448281.081987] RAX: 0000000000000063 RBX: ffff8803a28ebc48 RCX:
0000000000002fdb
[448281.090042] RDX: 0000000000000000 RSI: 0000000000000046 RDI:
0000000000000246
[448281.098093] RBP: ffff880458257d58 R08: ffffffff81af6100 R09:
0000000000000000
[448281.106146] R10: 0000000000000004 R11: 0000000000000000 R12:
0000000000000001
[448281.114202] R13: ffff88052e130400 R14: 0000000000000001 R15:
ffff8805beae9e10
[448281.122262] FS:  00007fa2e772f700(0000) GS:ffff880627280000(0000)
knlGS:0000000000000000
[448281.131386] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[448281.137879] CR2: ffffffffff600400 CR3: 00000005015a5000 CR4:
00000000000006e0
[448281.145929] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[448281.153974] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[448281.162043] Process ceph-osd (pid: 16018, threadinfo
ffff880458256000, task ffff88055b711940)
[448281.171646] Stack:
[448281.173987]  ffff880458257dff ffff8803a28eba98 ffff880458257d58
ffff8805beae9e10
[448281.182377]  0000000000000000 ffff88052e130400 ffff88029ff33380
ffff8803a28ebc48
[448281.190766]  ffff880458257e08 ffffffffa04ab4e6 0000000000000000
ffff8803a28ebc48
[448281.199155] Call Trace:
[448281.202005]  [<ffffffffa04ab4e6>] btrfs_truncate+0x5f6/0x660 [btrfs]
[448281.209203]  [<ffffffffa04ab646>] btrfs_setattr+0xf6/0x1a0 [btrfs]
[448281.216202]  [<ffffffff811816fb>] notify_change+0x18b/0x2b0
[448281.222517]  [<ffffffff81276541>] ? selinux_inode_permission+0xd1/0x130
[448281.229990]  [<ffffffff81165f44>] do_truncate+0x64/0xa0
[448281.235919]  [<ffffffff81172669>] ? inode_permission+0x49/0x100
[448281.242617]  [<ffffffff81166197>] sys_truncate+0x137/0x150
[448281.248838]  [<ffffffff8158b1e9>] system_call_fastpath+0x16/0x1b
[448281.255631] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 45 c0 48
85 f6 74 1b 80 bb 60 fe ff ff 84 74 12 48 c7 c7 e8 1d 50 a0 31 c0 e8
9d ea 0d e1 <0f> 0b eb fe 48 8b 73 40 eb e8 66 66 2e 0f 1f 84 00 00 00
00 00
[448281.277435] RIP  [<ffffffffa04a17ab>] btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.285229]  RSP <ffff880458257d18>
[448281.289667] ---[ end trace 9adc7b36a3e66872 ]---

Sorry,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-23 12:34                                               ` Christian Brunner
@ 2012-05-23 14:12                                                 ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-23 14:12 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik <josef@redhat.com>:
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> > take up all that space in the inode just for a marker.  I ran this patch for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Yeah it's because we access other parts of that bitfield with no lock at all
which is what is likely screwing us.  I'm going to have to redo that part and
then do the orphan fix, I'll have a patch shortly.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-23 14:12                                                 ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-23 14:12 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik <josef@redhat.com>:
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> > take up all that space in the inode just for a marker.  I ran this patch for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Yeah it's because we access other parts of that bitfield with no lock at all
which is what is likely screwing us.  I'm going to have to redo that part and
then do the orphan fix, I'll have a patch shortly.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-23 12:34                                               ` Christian Brunner
@ 2012-05-23 15:02                                                 ` Josef Bacik
  -1 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-23 15:02 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik <josef@redhat.com>:
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> > take up all that space in the inode just for a marker.  I ran this patch for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Ok give this a shot, it should do it.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..41ddec8 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -24,6 +24,22 @@
 #include "ordered-data.h"
 #include "delayed-inode.h"
 
+/*
+ * ordered_data_close is set by truncate when a file that used
+ * to have good data has been truncated to zero.  When it is set
+ * the btrfs file release call will add this inode to the
+ * ordered operations list so that we make sure to flush out any
+ * new data the application may have written before commit.
+ */
+#define BTRFS_INODE_ORDERED_DATA_CLOSE		0
+#define BTRFS_INODE_ORPHAN_META_RESERVED	1
+#define BTRFS_INODE_DUMMY			2
+#define BTRFS_INODE_IN_DEFRAG			3
+#define BTRFS_INODE_DELALLOC_META_RESERVED	4
+#define BTRFS_INODE_HAS_ORPHAN_ITEM		5
+#define BTRFS_INODE_FORCE_ZLIB			6
+#define BTRFS_INODE_FORCE_LZO			7
+
 /* in memory btrfs inode */
 struct btrfs_inode {
 	/* which subvolume this inode belongs to */
@@ -57,9 +73,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -143,24 +156,7 @@ struct btrfs_inode {
 	 */
 	unsigned outstanding_extents;
 	unsigned reserved_extents;
-
-	/*
-	 * ordered_data_close is set by truncate when a file that used
-	 * to have good data has been truncated to zero.  When it is set
-	 * the btrfs file release call will add this inode to the
-	 * ordered operations list so that we make sure to flush out any
-	 * new data the application may have written before commit.
-	 */
-	unsigned ordered_data_close:1;
-	unsigned orphan_meta_reserved:1;
-	unsigned dummy_inode:1;
-	unsigned in_defrag:1;
-	unsigned delalloc_meta_reserved:1;
-
-	/*
-	 * always compress this one file
-	 */
-	unsigned force_compress:4;
+	unsigned long runtime_flags;
 
 	struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 03e3748..5190861 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -669,8 +669,8 @@ static int btrfs_delayed_inode_reserve_metadata(
 		return ret;
 	} else if (src_rsv == &root->fs_info->delalloc_block_rsv) {
 		spin_lock(&BTRFS_I(inode)->lock);
-		if (BTRFS_I(inode)->delalloc_meta_reserved) {
-			BTRFS_I(inode)->delalloc_meta_reserved = 0;
+		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+				       &BTRFS_I(inode)->runtime_flags)) {
 			spin_unlock(&BTRFS_I(inode)->lock);
 			release = true;
 			goto migrate;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..0ddeb0d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
@@ -2001,7 +2001,8 @@ int open_ctree(struct super_block *sb,
 	BTRFS_I(fs_info->btree_inode)->root = tree_root;
 	memset(&BTRFS_I(fs_info->btree_inode)->location, 0,
 	       sizeof(struct btrfs_key));
-	BTRFS_I(fs_info->btree_inode)->dummy_inode = 1;
+	set_bit(BTRFS_INODE_DUMMY,
+		&BTRFS_I(fs_info->btree_inode)->runtime_flags);
 	insert_inode_hash(fs_info->btree_inode);
 
 	spin_lock_init(&fs_info->block_group_cache_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 49fd7b6..b372040 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4355,10 +4355,9 @@ static unsigned drop_outstanding_extent(struct inode *inode)
 	BTRFS_I(inode)->outstanding_extents--;
 
 	if (BTRFS_I(inode)->outstanding_extents == 0 &&
-	    BTRFS_I(inode)->delalloc_meta_reserved) {
+	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+			       &BTRFS_I(inode)->runtime_flags))
 		drop_inode_space = 1;
-		BTRFS_I(inode)->delalloc_meta_reserved = 0;
-	}
 
 	/*
 	 * If we have more or the same amount of outsanding extents than we have
@@ -4465,7 +4464,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	 * Add an item to reserve for updating the inode when we complete the
 	 * delalloc io.
 	 */
-	if (!BTRFS_I(inode)->delalloc_meta_reserved) {
+	if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+		      &BTRFS_I(inode)->runtime_flags)) {
 		nr_extents++;
 		extra_reserve = 1;
 	}
@@ -4511,7 +4511,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	if (extra_reserve) {
-		BTRFS_I(inode)->delalloc_meta_reserved = 1;
+		set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+			&BTRFS_I(inode)->runtime_flags);
 		nr_extents--;
 	}
 	BTRFS_I(inode)->reserved_extents += nr_extents;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 53bf2d7..2f19fe9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -103,7 +103,7 @@ static void __btrfs_add_inode_defrag(struct inode *inode,
 			goto exists;
 		}
 	}
-	BTRFS_I(inode)->in_defrag = 1;
+	set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 	rb_link_node(&defrag->rb_node, parent, p);
 	rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes);
 	return;
@@ -131,7 +131,7 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	if (btrfs_fs_closing(root->fs_info))
 		return 0;
 
-	if (BTRFS_I(inode)->in_defrag)
+	if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	if (trans)
@@ -148,7 +148,7 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	defrag->root = root->root_key.objectid;
 
 	spin_lock(&root->fs_info->defrag_inodes_lock);
-	if (!BTRFS_I(inode)->in_defrag)
+	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
 		__btrfs_add_inode_defrag(inode, defrag);
 	else
 		kfree(defrag);
@@ -252,7 +252,7 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 			goto next;
 
 		/* do a chunk of defrag */
-		BTRFS_I(inode)->in_defrag = 0;
+		clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 		range.start = defrag->last_offset;
 		num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 					       defrag_batch);
@@ -1466,8 +1466,8 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
 	 * flush down new bytes that may have been written if the
 	 * application were using truncate to replace a file in place.
 	 */
-	if (BTRFS_I(inode)->ordered_data_close) {
-		BTRFS_I(inode)->ordered_data_close = 0;
+	if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+			       &BTRFS_I(inode)->runtime_flags)) {
 		btrfs_add_ordered_operation(NULL, BTRFS_I(inode)->root, inode);
 		if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
 			filemap_flush(inode->i_mapping);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..1d42dba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -109,6 +109,15 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
 	return err;
 }
 
+static int btrfs_inode_force_compress(struct inode *inode)
+{
+	if (test_bit(BTRFS_INODE_FORCE_ZLIB, &BTRFS_I(inode)->runtime_flags))
+		return BTRFS_COMPRESS_ZLIB;
+	if (test_bit(BTRFS_INODE_FORCE_LZO, &BTRFS_I(inode)->runtime_flags))
+		return BTRFS_COMPRESS_LZO;
+	return BTRFS_COMPRESS_NONE;
+}
+
 /*
  * this does all the hard work for inserting an inline extent into
  * the btree.  The caller should have done a btrfs_drop_extents so that
@@ -396,7 +405,7 @@ again:
 	 */
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) &&
 	    (btrfs_test_opt(root, COMPRESS) ||
-	     (BTRFS_I(inode)->force_compress) ||
+	     btrfs_inode_force_compress(inode) ||
 	     (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))) {
 		WARN_ON(pages);
 		pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
@@ -405,8 +414,8 @@ again:
 			goto cont;
 		}
 
-		if (BTRFS_I(inode)->force_compress)
-			compress_type = BTRFS_I(inode)->force_compress;
+		if (btrfs_inode_force_compress(inode))
+			compress_type = btrfs_inode_force_compress(inode);
 
 		ret = btrfs_compress_pages(compress_type,
 					   inode->i_mapping, start,
@@ -514,7 +523,7 @@ cont:
 
 		/* flag the file so we don't compress in the future */
 		if (!btrfs_test_opt(root, FORCE_COMPRESS) &&
-		    !(BTRFS_I(inode)->force_compress)) {
+		    !btrfs_inode_force_compress(inode)) {
 			BTRFS_I(inode)->flags |= BTRFS_INODE_NOCOMPRESS;
 		}
 	}
@@ -1365,7 +1374,7 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
 	else if (!btrfs_test_opt(root, COMPRESS) &&
-		 !(BTRFS_I(inode)->force_compress) &&
+		 !btrfs_inode_force_compress(inode) &&
 		 !(BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1);
@@ -2072,12 +2081,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2143,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!test_and_set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+			      &BTRFS_I(inode)->runtime_flags)) {
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,12 +2157,12 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
-	if (!BTRFS_I(inode)->orphan_meta_reserved) {
-		BTRFS_I(inode)->orphan_meta_reserved = 1;
+	if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+			      &BTRFS_I(inode)->runtime_flags))
 		reserve = 1;
-	}
 	spin_unlock(&root->orphan_lock);
 
 	/* grab metadata reservation from transaction handle */
@@ -2166,6 +2175,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+				  &BTRFS_I(inode)->runtime_flags);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2195,26 +2206,33 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
+	if (trans && test_and_clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+					&BTRFS_I(inode)->runtime_flags))
 		delete_item = 1;
-	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
-		BTRFS_I(inode)->orphan_meta_reserved = 0;
+	if (trans && test_and_clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+					&BTRFS_I(inode)->runtime_flags))
 		release_rsv = 1;
-	}
-	spin_unlock(&root->orphan_lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
+			       btrfs_ino(inode), inode->i_nlink, root->objectid,
+			       root->orphan_item_inserted ? "yes" : "no");
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2341,6 +2359,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2352,9 +2372,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+			&BTRFS_I(inode)->runtime_flags);
+		atomic_inc(&root->orphan_inodes);
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3607,7 +3627,8 @@ static int btrfs_setsize(struct inode *inode, loff_t newsize)
 		 * any new writes get down to disk quickly.
 		 */
 		if (newsize == 0)
-			BTRFS_I(inode)->ordered_data_close = 1;
+			set_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+				&BTRFS_I(inode)->runtime_flags);
 
 		/* we don't support swapfiles, so vmtruncate shouldn't fail */
 		truncate_setsize(inode, newsize);
@@ -3671,7 +3692,8 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+				 &BTRFS_I(inode)->runtime_flags));
 		goto no_delete;
 	}
 
@@ -4066,7 +4088,7 @@ static struct inode *new_simple_dir(struct super_block *s,
 
 	BTRFS_I(inode)->root = root;
 	memcpy(&BTRFS_I(inode)->location, key, sizeof(*key));
-	BTRFS_I(inode)->dummy_inode = 1;
+	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
 	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
 	inode->i_op = &btrfs_dir_ro_inode_operations;
@@ -4370,7 +4392,7 @@ int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	int ret = 0;
 	bool nolock = false;
 
-	if (BTRFS_I(inode)->dummy_inode)
+	if (test_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	if (btrfs_fs_closing(root->fs_info) && btrfs_is_free_space_inode(root, inode))
@@ -4403,7 +4425,7 @@ int btrfs_dirty_inode(struct inode *inode)
 	struct btrfs_trans_handle *trans;
 	int ret;
 
-	if (BTRFS_I(inode)->dummy_inode)
+	if (test_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	trans = btrfs_join_transaction(root);
@@ -6685,7 +6707,7 @@ static int btrfs_truncate(struct inode *inode)
 
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6727,8 +6749,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6771,7 +6795,8 @@ static int btrfs_truncate(struct inode *inode)
 	 * using truncate to replace the contents of the file will
 	 * end up with a zero length file after a crash.
 	 */
-	if (inode->i_size == 0 && BTRFS_I(inode)->ordered_data_close)
+	if (inode->i_size == 0 && test_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+					   &BTRFS_I(inode)->runtime_flags))
 		btrfs_add_ordered_operation(trans, root, inode);
 
 	while (1) {
@@ -6847,7 +6872,7 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
 	if (ret && !err)
 		err = ret;
 
@@ -6909,12 +6934,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->outstanding_extents = 0;
 	ei->reserved_extents = 0;
 
-	ei->ordered_data_close = 0;
-	ei->orphan_meta_reserved = 0;
-	ei->dummy_inode = 0;
-	ei->in_defrag = 0;
-	ei->delalloc_meta_reserved = 0;
-	ei->force_compress = BTRFS_COMPRESS_NONE;
+	ei->runtime_flags = 0;
 
 	ei->delayed_node = NULL;
 
@@ -6927,7 +6947,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +6991,12 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+		     &BTRFS_I(inode)->runtime_flags)) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 14f8e1f..a901654 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1039,6 +1039,21 @@ out:
 
 }
 
+static void btrfs_set_inode_force_compress(struct inode *inode,
+					   int compress_type)
+{
+	if (compress_type == BTRFS_COMPRESS_ZLIB) {
+		set_bit(BTRFS_INODE_FORCE_ZLIB, &BTRFS_I(inode)->runtime_flags);
+	} else if (compress_type == BTRFS_COMPRESS_LZO) {
+		set_bit(BTRFS_INODE_FORCE_LZO, &BTRFS_I(inode)->runtime_flags);
+	} else if (compress_type == BTRFS_COMPRESS_NONE) {
+		clear_bit(BTRFS_INODE_FORCE_ZLIB,
+			  &BTRFS_I(inode)->runtime_flags);
+		clear_bit(BTRFS_INODE_FORCE_LZO,
+			  &BTRFS_I(inode)->runtime_flags);
+	}
+}
+
 int btrfs_defrag_file(struct inode *inode, struct file *file,
 		      struct btrfs_ioctl_defrag_range_args *range,
 		      u64 newer_than, unsigned long max_to_defrag)
@@ -1162,7 +1177,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		if (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS)
-			BTRFS_I(inode)->force_compress = compress_type;
+			btrfs_set_inode_force_compress(inode, compress_type);
 
 		if (i + cluster > ra_index) {
 			ra_index = max(i, ra_index);
@@ -1230,7 +1245,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		atomic_dec(&root->fs_info->async_submit_draining);
 
 		mutex_lock(&inode->i_mutex);
-		BTRFS_I(inode)->force_compress = BTRFS_COMPRESS_NONE;
+		btrfs_set_inode_force_compress(inode, BTRFS_COMPRESS_NONE);
 		mutex_unlock(&inode->i_mutex);
 	}
 

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
@ 2012-05-23 15:02                                                 ` Josef Bacik
  0 siblings, 0 replies; 66+ messages in thread
From: Josef Bacik @ 2012-05-23 15:02 UTC (permalink / raw)
  To: Christian Brunner
  Cc: Josef Bacik, miaox, Martin Mailand, Sage Weil, linux-btrfs, ceph-devel

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik <josef@redhat.com>:
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> > take up all that space in the inode just for a marker.  I ran this patch for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Ok give this a shot, it should do it.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..41ddec8 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -24,6 +24,22 @@
 #include "ordered-data.h"
 #include "delayed-inode.h"
 
+/*
+ * ordered_data_close is set by truncate when a file that used
+ * to have good data has been truncated to zero.  When it is set
+ * the btrfs file release call will add this inode to the
+ * ordered operations list so that we make sure to flush out any
+ * new data the application may have written before commit.
+ */
+#define BTRFS_INODE_ORDERED_DATA_CLOSE		0
+#define BTRFS_INODE_ORPHAN_META_RESERVED	1
+#define BTRFS_INODE_DUMMY			2
+#define BTRFS_INODE_IN_DEFRAG			3
+#define BTRFS_INODE_DELALLOC_META_RESERVED	4
+#define BTRFS_INODE_HAS_ORPHAN_ITEM		5
+#define BTRFS_INODE_FORCE_ZLIB			6
+#define BTRFS_INODE_FORCE_LZO			7
+
 /* in memory btrfs inode */
 struct btrfs_inode {
 	/* which subvolume this inode belongs to */
@@ -57,9 +73,6 @@ struct btrfs_inode {
 	/* used to order data wrt metadata */
 	struct btrfs_ordered_inode_tree ordered_tree;
 
-	/* for keeping track of orphaned inodes */
-	struct list_head i_orphan;
-
 	/* list of all the delalloc inodes in the FS.  There are times we need
 	 * to write all the delalloc pages to disk, and this list is used
 	 * to walk them all.
@@ -143,24 +156,7 @@ struct btrfs_inode {
 	 */
 	unsigned outstanding_extents;
 	unsigned reserved_extents;
-
-	/*
-	 * ordered_data_close is set by truncate when a file that used
-	 * to have good data has been truncated to zero.  When it is set
-	 * the btrfs file release call will add this inode to the
-	 * ordered operations list so that we make sure to flush out any
-	 * new data the application may have written before commit.
-	 */
-	unsigned ordered_data_close:1;
-	unsigned orphan_meta_reserved:1;
-	unsigned dummy_inode:1;
-	unsigned in_defrag:1;
-	unsigned delalloc_meta_reserved:1;
-
-	/*
-	 * always compress this one file
-	 */
-	unsigned force_compress:4;
+	unsigned long runtime_flags;
 
 	struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
 	struct list_head root_list;
 
 	spinlock_t orphan_lock;
-	struct list_head orphan_list;
+	atomic_t orphan_inodes;
 	struct btrfs_block_rsv *orphan_block_rsv;
 	int orphan_item_inserted;
 	int orphan_cleanup_state;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 03e3748..5190861 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -669,8 +669,8 @@ static int btrfs_delayed_inode_reserve_metadata(
 		return ret;
 	} else if (src_rsv == &root->fs_info->delalloc_block_rsv) {
 		spin_lock(&BTRFS_I(inode)->lock);
-		if (BTRFS_I(inode)->delalloc_meta_reserved) {
-			BTRFS_I(inode)->delalloc_meta_reserved = 0;
+		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+				       &BTRFS_I(inode)->runtime_flags)) {
 			spin_unlock(&BTRFS_I(inode)->lock);
 			release = true;
 			goto migrate;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..0ddeb0d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->orphan_block_rsv = NULL;
 
 	INIT_LIST_HEAD(&root->dirty_list);
-	INIT_LIST_HEAD(&root->orphan_list);
 	INIT_LIST_HEAD(&root->root_list);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_commit[0], 0);
 	atomic_set(&root->log_commit[1], 0);
 	atomic_set(&root->log_writers, 0);
+	atomic_set(&root->orphan_inodes, 0);
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
@@ -2001,7 +2001,8 @@ int open_ctree(struct super_block *sb,
 	BTRFS_I(fs_info->btree_inode)->root = tree_root;
 	memset(&BTRFS_I(fs_info->btree_inode)->location, 0,
 	       sizeof(struct btrfs_key));
-	BTRFS_I(fs_info->btree_inode)->dummy_inode = 1;
+	set_bit(BTRFS_INODE_DUMMY,
+		&BTRFS_I(fs_info->btree_inode)->runtime_flags);
 	insert_inode_hash(fs_info->btree_inode);
 
 	spin_lock_init(&fs_info->block_group_cache_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 49fd7b6..b372040 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4355,10 +4355,9 @@ static unsigned drop_outstanding_extent(struct inode *inode)
 	BTRFS_I(inode)->outstanding_extents--;
 
 	if (BTRFS_I(inode)->outstanding_extents == 0 &&
-	    BTRFS_I(inode)->delalloc_meta_reserved) {
+	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+			       &BTRFS_I(inode)->runtime_flags))
 		drop_inode_space = 1;
-		BTRFS_I(inode)->delalloc_meta_reserved = 0;
-	}
 
 	/*
 	 * If we have more or the same amount of outsanding extents than we have
@@ -4465,7 +4464,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	 * Add an item to reserve for updating the inode when we complete the
 	 * delalloc io.
 	 */
-	if (!BTRFS_I(inode)->delalloc_meta_reserved) {
+	if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+		      &BTRFS_I(inode)->runtime_flags)) {
 		nr_extents++;
 		extra_reserve = 1;
 	}
@@ -4511,7 +4511,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	if (extra_reserve) {
-		BTRFS_I(inode)->delalloc_meta_reserved = 1;
+		set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+			&BTRFS_I(inode)->runtime_flags);
 		nr_extents--;
 	}
 	BTRFS_I(inode)->reserved_extents += nr_extents;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 53bf2d7..2f19fe9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -103,7 +103,7 @@ static void __btrfs_add_inode_defrag(struct inode *inode,
 			goto exists;
 		}
 	}
-	BTRFS_I(inode)->in_defrag = 1;
+	set_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 	rb_link_node(&defrag->rb_node, parent, p);
 	rb_insert_color(&defrag->rb_node, &root->fs_info->defrag_inodes);
 	return;
@@ -131,7 +131,7 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	if (btrfs_fs_closing(root->fs_info))
 		return 0;
 
-	if (BTRFS_I(inode)->in_defrag)
+	if (test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	if (trans)
@@ -148,7 +148,7 @@ int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
 	defrag->root = root->root_key.objectid;
 
 	spin_lock(&root->fs_info->defrag_inodes_lock);
-	if (!BTRFS_I(inode)->in_defrag)
+	if (!test_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags))
 		__btrfs_add_inode_defrag(inode, defrag);
 	else
 		kfree(defrag);
@@ -252,7 +252,7 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 			goto next;
 
 		/* do a chunk of defrag */
-		BTRFS_I(inode)->in_defrag = 0;
+		clear_bit(BTRFS_INODE_IN_DEFRAG, &BTRFS_I(inode)->runtime_flags);
 		range.start = defrag->last_offset;
 		num_defrag = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 					       defrag_batch);
@@ -1466,8 +1466,8 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
 	 * flush down new bytes that may have been written if the
 	 * application were using truncate to replace a file in place.
 	 */
-	if (BTRFS_I(inode)->ordered_data_close) {
-		BTRFS_I(inode)->ordered_data_close = 0;
+	if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+			       &BTRFS_I(inode)->runtime_flags)) {
 		btrfs_add_ordered_operation(NULL, BTRFS_I(inode)->root, inode);
 		if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
 			filemap_flush(inode->i_mapping);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..1d42dba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -109,6 +109,15 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans,
 	return err;
 }
 
+static int btrfs_inode_force_compress(struct inode *inode)
+{
+	if (test_bit(BTRFS_INODE_FORCE_ZLIB, &BTRFS_I(inode)->runtime_flags))
+		return BTRFS_COMPRESS_ZLIB;
+	if (test_bit(BTRFS_INODE_FORCE_LZO, &BTRFS_I(inode)->runtime_flags))
+		return BTRFS_COMPRESS_LZO;
+	return BTRFS_COMPRESS_NONE;
+}
+
 /*
  * this does all the hard work for inserting an inline extent into
  * the btree.  The caller should have done a btrfs_drop_extents so that
@@ -396,7 +405,7 @@ again:
 	 */
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) &&
 	    (btrfs_test_opt(root, COMPRESS) ||
-	     (BTRFS_I(inode)->force_compress) ||
+	     btrfs_inode_force_compress(inode) ||
 	     (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))) {
 		WARN_ON(pages);
 		pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
@@ -405,8 +414,8 @@ again:
 			goto cont;
 		}
 
-		if (BTRFS_I(inode)->force_compress)
-			compress_type = BTRFS_I(inode)->force_compress;
+		if (btrfs_inode_force_compress(inode))
+			compress_type = btrfs_inode_force_compress(inode);
 
 		ret = btrfs_compress_pages(compress_type,
 					   inode->i_mapping, start,
@@ -514,7 +523,7 @@ cont:
 
 		/* flag the file so we don't compress in the future */
 		if (!btrfs_test_opt(root, FORCE_COMPRESS) &&
-		    !(BTRFS_I(inode)->force_compress)) {
+		    !btrfs_inode_force_compress(inode)) {
 			BTRFS_I(inode)->flags |= BTRFS_INODE_NOCOMPRESS;
 		}
 	}
@@ -1365,7 +1374,7 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
 	else if (!btrfs_test_opt(root, COMPRESS) &&
-		 !(BTRFS_I(inode)->force_compress) &&
+		 !btrfs_inode_force_compress(inode) &&
 		 !(BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1);
@@ -2072,12 +2081,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
 	struct btrfs_block_rsv *block_rsv;
 	int ret;
 
-	if (!list_empty(&root->orphan_list) ||
+	if (atomic_read(&root->orphan_inodes) ||
 	    root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
 		return;
 
 	spin_lock(&root->orphan_lock);
-	if (!list_empty(&root->orphan_list)) {
+	if (atomic_read(&root->orphan_inodes)) {
 		spin_unlock(&root->orphan_lock);
 		return;
 	}
@@ -2134,8 +2143,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 		block_rsv = NULL;
 	}
 
-	if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+	if (!test_and_set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+			      &BTRFS_I(inode)->runtime_flags)) {
 #if 0
 		/*
 		 * For proper ENOSPC handling, we should do orphan
@@ -2148,12 +2157,12 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 			insert = 1;
 #endif
 		insert = 1;
+		atomic_inc(&root->orphan_inodes);
 	}
 
-	if (!BTRFS_I(inode)->orphan_meta_reserved) {
-		BTRFS_I(inode)->orphan_meta_reserved = 1;
+	if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+			      &BTRFS_I(inode)->runtime_flags))
 		reserve = 1;
-	}
 	spin_unlock(&root->orphan_lock);
 
 	/* grab metadata reservation from transaction handle */
@@ -2166,6 +2175,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
 	if (insert >= 1) {
 		ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
 		if (ret && ret != -EEXIST) {
+			clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+				  &BTRFS_I(inode)->runtime_flags);
 			btrfs_abort_transaction(trans, root, ret);
 			return ret;
 		}
@@ -2195,26 +2206,33 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
 	int release_rsv = 0;
 	int ret = 0;
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+	/*
+	 * evict_inode gets called without holding the i_mutex so we need to
+	 * take the orphan lock to make sure we are safe in messing with these.
+	 */
+	if (trans && test_and_clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+					&BTRFS_I(inode)->runtime_flags))
 		delete_item = 1;
-	}
 
-	if (BTRFS_I(inode)->orphan_meta_reserved) {
-		BTRFS_I(inode)->orphan_meta_reserved = 0;
+	if (trans && test_and_clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+					&BTRFS_I(inode)->runtime_flags))
 		release_rsv = 1;
-	}
-	spin_unlock(&root->orphan_lock);
 
 	if (trans && delete_item) {
 		ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+		if (ret)
+			printk(KERN_ERR "couldn't find orphan item for %Lu, nlink %d, root %Lu, root being deleted %s\n",
+			       btrfs_ino(inode), inode->i_nlink, root->objectid,
+			       root->orphan_item_inserted ? "yes" : "no");
 		BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
 	}
 
 	if (release_rsv)
 		btrfs_orphan_release_metadata(inode);
 
+	if (trans && delete_item)
+		atomic_dec(&root->orphan_inodes);
+
 	return 0;
 }
 
@@ -2341,6 +2359,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 				ret = PTR_ERR(trans);
 				goto out;
 			}
+			printk(KERN_ERR "auto deleting %Lu\n",
+			       found_key.objectid);
 			ret = btrfs_del_orphan_item(trans, root,
 						    found_key.objectid);
 			BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
@@ -2352,9 +2372,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * add this inode to the orphan list so btrfs_orphan_del does
 		 * the proper thing when we hit it
 		 */
-		spin_lock(&root->orphan_lock);
-		list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
-		spin_unlock(&root->orphan_lock);
+		set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+			&BTRFS_I(inode)->runtime_flags);
+		atomic_inc(&root->orphan_inodes);
 
 		/* if we have links, this was a truncate, lets do that */
 		if (inode->i_nlink) {
@@ -3607,7 +3627,8 @@ static int btrfs_setsize(struct inode *inode, loff_t newsize)
 		 * any new writes get down to disk quickly.
 		 */
 		if (newsize == 0)
-			BTRFS_I(inode)->ordered_data_close = 1;
+			set_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+				&BTRFS_I(inode)->runtime_flags);
 
 		/* we don't support swapfiles, so vmtruncate shouldn't fail */
 		truncate_setsize(inode, newsize);
@@ -3671,7 +3692,8 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
 	if (root->fs_info->log_root_recovering) {
-		BUG_ON(!list_empty(&BTRFS_I(inode)->i_orphan));
+		BUG_ON(!test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+				 &BTRFS_I(inode)->runtime_flags));
 		goto no_delete;
 	}
 
@@ -4066,7 +4088,7 @@ static struct inode *new_simple_dir(struct super_block *s,
 
 	BTRFS_I(inode)->root = root;
 	memcpy(&BTRFS_I(inode)->location, key, sizeof(*key));
-	BTRFS_I(inode)->dummy_inode = 1;
+	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 
 	inode->i_ino = BTRFS_EMPTY_SUBVOL_DIR_OBJECTID;
 	inode->i_op = &btrfs_dir_ro_inode_operations;
@@ -4370,7 +4392,7 @@ int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	int ret = 0;
 	bool nolock = false;
 
-	if (BTRFS_I(inode)->dummy_inode)
+	if (test_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	if (btrfs_fs_closing(root->fs_info) && btrfs_is_free_space_inode(root, inode))
@@ -4403,7 +4425,7 @@ int btrfs_dirty_inode(struct inode *inode)
 	struct btrfs_trans_handle *trans;
 	int ret;
 
-	if (BTRFS_I(inode)->dummy_inode)
+	if (test_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags))
 		return 0;
 
 	trans = btrfs_join_transaction(root);
@@ -6685,7 +6707,7 @@ static int btrfs_truncate(struct inode *inode)
 
 	ret = btrfs_truncate_page(inode->i_mapping, inode->i_size);
 	if (ret)
-		return ret;
+		goto real_out;
 
 	btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
 	btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
@@ -6727,8 +6749,10 @@ static int btrfs_truncate(struct inode *inode)
 	 * updating the inode.
 	 */
 	rsv = btrfs_alloc_block_rsv(root);
-	if (!rsv)
-		return -ENOMEM;
+	if (!rsv) {
+		ret = -ENOMEM;
+		goto real_out;
+	}
 	rsv->size = min_size;
 
 	/*
@@ -6771,7 +6795,8 @@ static int btrfs_truncate(struct inode *inode)
 	 * using truncate to replace the contents of the file will
 	 * end up with a zero length file after a crash.
 	 */
-	if (inode->i_size == 0 && BTRFS_I(inode)->ordered_data_close)
+	if (inode->i_size == 0 && test_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+					   &BTRFS_I(inode)->runtime_flags))
 		btrfs_add_ordered_operation(trans, root, inode);
 
 	while (1) {
@@ -6847,7 +6872,7 @@ end_trans:
 
 out:
 	btrfs_free_block_rsv(root, rsv);
-
+real_out:
 	if (ret && !err)
 		err = ret;
 
@@ -6909,12 +6934,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->outstanding_extents = 0;
 	ei->reserved_extents = 0;
 
-	ei->ordered_data_close = 0;
-	ei->orphan_meta_reserved = 0;
-	ei->dummy_inode = 0;
-	ei->in_defrag = 0;
-	ei->delalloc_meta_reserved = 0;
-	ei->force_compress = BTRFS_COMPRESS_NONE;
+	ei->runtime_flags = 0;
 
 	ei->delayed_node = NULL;
 
@@ -6927,7 +6947,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	mutex_init(&ei->log_mutex);
 	mutex_init(&ei->delalloc_mutex);
 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
-	INIT_LIST_HEAD(&ei->i_orphan);
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->ordered_operations);
 	RB_CLEAR_NODE(&ei->rb_node);
@@ -6972,13 +6991,12 @@ void btrfs_destroy_inode(struct inode *inode)
 		spin_unlock(&root->fs_info->ordered_extent_lock);
 	}
 
-	spin_lock(&root->orphan_lock);
-	if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+	if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+		     &BTRFS_I(inode)->runtime_flags)) {
 		printk(KERN_INFO "BTRFS: inode %llu still on the orphan list\n",
 		       (unsigned long long)btrfs_ino(inode));
-		list_del_init(&BTRFS_I(inode)->i_orphan);
+		atomic_dec(&root->orphan_inodes);
 	}
-	spin_unlock(&root->orphan_lock);
 
 	while (1) {
 		ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 14f8e1f..a901654 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1039,6 +1039,21 @@ out:
 
 }
 
+static void btrfs_set_inode_force_compress(struct inode *inode,
+					   int compress_type)
+{
+	if (compress_type == BTRFS_COMPRESS_ZLIB) {
+		set_bit(BTRFS_INODE_FORCE_ZLIB, &BTRFS_I(inode)->runtime_flags);
+	} else if (compress_type == BTRFS_COMPRESS_LZO) {
+		set_bit(BTRFS_INODE_FORCE_LZO, &BTRFS_I(inode)->runtime_flags);
+	} else if (compress_type == BTRFS_COMPRESS_NONE) {
+		clear_bit(BTRFS_INODE_FORCE_ZLIB,
+			  &BTRFS_I(inode)->runtime_flags);
+		clear_bit(BTRFS_INODE_FORCE_LZO,
+			  &BTRFS_I(inode)->runtime_flags);
+	}
+}
+
 int btrfs_defrag_file(struct inode *inode, struct file *file,
 		      struct btrfs_ioctl_defrag_range_args *range,
 		      u64 newer_than, unsigned long max_to_defrag)
@@ -1162,7 +1177,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		}
 
 		if (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS)
-			BTRFS_I(inode)->force_compress = compress_type;
+			btrfs_set_inode_force_compress(inode, compress_type);
 
 		if (i + cluster > ra_index) {
 			ra_index = max(i, ra_index);
@@ -1230,7 +1245,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 		atomic_dec(&root->fs_info->async_submit_draining);
 
 		mutex_lock(&inode->i_mutex);
-		BTRFS_I(inode)->force_compress = BTRFS_COMPRESS_NONE;
+		btrfs_set_inode_force_compress(inode, BTRFS_COMPRESS_NONE);
 		mutex_unlock(&inode->i_mutex);
 	}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-23 15:02                                                 ` Josef Bacik
  (?)
@ 2012-05-23 19:12                                                 ` Martin Mailand
  2012-05-24  6:03                                                   ` Martin Mailand
  -1 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-23 19:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Christian Brunner, miaox, Sage Weil, linux-btrfs, ceph-devel

Hi Josef,

this patch is running for 3 hours without a Bug and without the Warning.
I will let it run overnight and report tomorrow.
It looks very good ;-)

-martin

Am 23.05.2012 17:02, schrieb Josef Bacik:
> Ok give this a shot, it should do it.  Thanks,

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-23 19:12                                                 ` Martin Mailand
@ 2012-05-24  6:03                                                   ` Martin Mailand
  2012-05-24  9:37                                                     ` Christian Brunner
  0 siblings, 1 reply; 66+ messages in thread
From: Martin Mailand @ 2012-05-24  6:03 UTC (permalink / raw)
  To: martin
  Cc: Josef Bacik, Christian Brunner, miaox, Sage Weil, linux-btrfs,
	ceph-devel

Hi,
the ceph cluster is running under heavy load for the last 13 hours 
without a problem, dmesg is empty and the performance is good.

-martin

Am 23.05.2012 21:12, schrieb Martin Mailand:
> this patch is running for 3 hours without a Bug and without the Warning.
> I will let it run overnight and report tomorrow.
> It looks very good ;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Ceph on btrfs 3.4rc
  2012-05-24  6:03                                                   ` Martin Mailand
@ 2012-05-24  9:37                                                     ` Christian Brunner
  0 siblings, 0 replies; 66+ messages in thread
From: Christian Brunner @ 2012-05-24  9:37 UTC (permalink / raw)
  To: linux-btrfs, ceph-devel

Same thing here.

I've tried really hard, but even after 12 hours I wasn't able to get a
single warning from btrfs.

I think you cracked it!

Thanks,
Christian

2012/5/24 Martin Mailand <martin@tuxadero.com>:
> Hi,
> the ceph cluster is running under heavy load for the last 13 hours without a
> problem, dmesg is empty and the performance is good.
>
> -martin
>
> Am 23.05.2012 21:12, schrieb Martin Mailand:
>
>> this patch is running for 3 hours without a Bug and without the Warning.
>> I will let it run overnight and report tomorrow.
>> It looks very good ;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2012-05-24  9:37 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-20 15:09 Ceph on btrfs 3.4rc Christian Brunner
2012-04-23  7:20 ` Christian Brunner
2012-04-23  7:20   ` Christian Brunner
2012-04-24 15:21 ` Josef Bacik
2012-04-24 16:26   ` Sage Weil
2012-04-24 17:33     ` Josef Bacik
2012-04-24 17:41       ` Neil Horman
2012-04-25 11:28     ` Christian Brunner
2012-04-25 12:16       ` João Eduardo Luís
2012-04-27 11:02     ` Christian Brunner
2012-05-03 14:13       ` Josef Bacik
2012-05-03 14:13         ` Josef Bacik
2012-05-03 15:17         ` Josh Durgin
2012-05-03 15:17           ` Josh Durgin
2012-05-03 15:20           ` Josef Bacik
2012-05-03 15:20             ` Josef Bacik
2012-05-03 16:38             ` Josh Durgin
2012-05-03 16:38               ` Josh Durgin
2012-05-03 19:49               ` Josef Bacik
2012-05-03 19:49                 ` Josef Bacik
2012-05-04 20:24                 ` Christian Brunner
2012-05-04 20:24                   ` Christian Brunner
2012-05-09 20:25                   ` Josef Bacik
2012-05-09 20:25                     ` Josef Bacik
2012-05-10 17:40       ` Josef Bacik
2012-05-10 17:40         ` Josef Bacik
2012-05-10 20:35       ` Josef Bacik
2012-05-10 20:35         ` Josef Bacik
2012-05-11 13:31         ` Josef Bacik
2012-05-11 13:31           ` Josef Bacik
2012-05-11 18:33           ` Martin Mailand
2012-05-11 19:16             ` Josef Bacik
2012-05-14 14:19               ` Martin Mailand
2012-05-14 14:20                 ` Josef Bacik
2012-05-16 19:20                   ` Josef Bacik
2012-05-17 10:29                     ` Martin Mailand
2012-05-17 14:43                       ` Josef Bacik
2012-05-17 15:12                         ` Martin Mailand
2012-05-17 19:43                           ` Josef Bacik
2012-05-17 20:54                             ` Christian Brunner
2012-05-17 21:18                               ` Martin Mailand
2012-05-18 14:48                                 ` Josef Bacik
2012-05-18 17:24                                   ` Martin Mailand
2012-05-18 19:01                                     ` Josef Bacik
2012-05-18 20:11                                       ` Martin Mailand
2012-05-21  3:59                                       ` Miao Xie
2012-05-22 10:29                                         ` Christian Brunner
2012-05-22 10:29                                           ` Christian Brunner
2012-05-22 17:33                                           ` Josef Bacik
2012-05-22 17:33                                             ` Josef Bacik
2012-05-23 12:34                                             ` Christian Brunner
2012-05-23 12:34                                               ` Christian Brunner
2012-05-23 14:12                                               ` Josef Bacik
2012-05-23 14:12                                                 ` Josef Bacik
2012-05-23 15:02                                               ` Josef Bacik
2012-05-23 15:02                                                 ` Josef Bacik
2012-05-23 19:12                                                 ` Martin Mailand
2012-05-24  6:03                                                   ` Martin Mailand
2012-05-24  9:37                                                     ` Christian Brunner
2012-05-22 13:31                                         ` Josef Bacik
2012-05-11 13:46         ` Christian Brunner
2012-05-11 13:46           ` Christian Brunner
2012-04-29 21:09 ` tsuna
2012-04-30 10:28   ` Christian Brunner
2012-04-30 10:28     ` Christian Brunner
2012-04-30 10:54     ` Amon Ott

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.