* Bluestore: inaccurate disk usage statistics problem?
@ 2017-12-26 3:25 Zhi Zhang
2017-12-26 22:36 ` [ceph-users] " Sage Weil
2018-01-04 12:39 ` Igor Fedotov
0 siblings, 2 replies; 8+ messages in thread
From: Zhi Zhang @ 2017-12-26 3:25 UTC (permalink / raw)
To: ceph-devel, ceph-users
Hi,
We recently started to test bluestore with huge amount of small files
(only dozens of bytes per file). We have 22 OSDs in a test cluster
using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
we wrote about 150 million files through cephfs, we found each OSD
disk usage reported by "ceph osd df" was more than 40%, which meant
more than 800GB was used for each disk, but the actual total file size
was only about 5.2 GB, which was reported by "ceph df" and also
calculated by ourselves.
The test is ongoing. I wonder whether the cluster would report OSD
full after we wrote about 300 million files, however the actual total
file size would be far far less than the disk usage. I will update the
result when the test is done.
My question is, whether the disk usage statistics in bluestore is
inaccurate, or the padding, alignment stuff or something else in
bluestore wastes the disk space?
Thanks!
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
TOTAL 40984G 18861G 22122G 46.02
$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
40984G 22122G 18861G 46.02
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
cephfs_metadata 5 160M 0 6964G 77342
cephfs_data 6 5193M 0.04 6964G 151292669
Regards,
Zhi Zhang (David)
Contact: zhang.david2011-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
zhangz.david-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
2017-12-26 3:25 Bluestore: inaccurate disk usage statistics problem? Zhi Zhang
@ 2017-12-26 22:36 ` Sage Weil
2017-12-27 2:59 ` Zhi Zhang
2018-01-04 12:39 ` Igor Fedotov
1 sibling, 1 reply; 8+ messages in thread
From: Sage Weil @ 2017-12-26 22:36 UTC (permalink / raw)
To: Zhi Zhang; +Cc: ceph-devel, ceph-users
On Tue, 26 Dec 2017, Zhi Zhang wrote:
> Hi,
>
> We recently started to test bluestore with huge amount of small files
> (only dozens of bytes per file). We have 22 OSDs in a test cluster
> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
> we wrote about 150 million files through cephfs, we found each OSD
> disk usage reported by "ceph osd df" was more than 40%, which meant
> more than 800GB was used for each disk, but the actual total file size
> was only about 5.2 GB, which was reported by "ceph df" and also
> calculated by ourselves.
>
> The test is ongoing. I wonder whether the cluster would report OSD
> full after we wrote about 300 million files, however the actual total
> file size would be far far less than the disk usage. I will update the
> result when the test is done.
>
> My question is, whether the disk usage statistics in bluestore is
> inaccurate, or the padding, alignment stuff or something else in
> bluestore wastes the disk space?
Bluestore isn't making any attempt to optimize for small files, so a
one byte file will consume min_alloc_size (64kb on HDD, 16kb on SSD,
IIRC).
It probably wouldn't be too difficult to add an "inline" data for small
objects feature that puts small objects in rocksdb...
sage
>
> Thanks!
>
> $ ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
> 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
> 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
> 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
> 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
> 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
> 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
> 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
> 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
> 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
> 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
> 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
> 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
> 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
> 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
> 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
> 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
> 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
> 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
> 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
> 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
> 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
> TOTAL 40984G 18861G 22122G 46.02
>
> $ ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 40984G 22122G 18861G 46.02
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> cephfs_metadata 5 160M 0 6964G 77342
> cephfs_data 6 5193M 0.04 6964G 151292669
>
>
> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2011@gmail.com
> zhangz.david@outlook.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
2017-12-26 22:36 ` [ceph-users] " Sage Weil
@ 2017-12-27 2:59 ` Zhi Zhang
0 siblings, 0 replies; 8+ messages in thread
From: Zhi Zhang @ 2017-12-27 2:59 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, ceph-users
Hi Sage,
Thanks for the quick reply. I read the code and our test also proved
that disk space was wasted due to min_alloc_size.
Very look forward to the "inline" data feature for small objects. We
will also look into this feature and hopefully work with community on
it.
Regards,
Zhi Zhang (David)
Contact: zhang.david2011@gmail.com
zhangz.david@outlook.com
On Wed, Dec 27, 2017 at 6:36 AM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 26 Dec 2017, Zhi Zhang wrote:
>> Hi,
>>
>> We recently started to test bluestore with huge amount of small files
>> (only dozens of bytes per file). We have 22 OSDs in a test cluster
>> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
>> we wrote about 150 million files through cephfs, we found each OSD
>> disk usage reported by "ceph osd df" was more than 40%, which meant
>> more than 800GB was used for each disk, but the actual total file size
>> was only about 5.2 GB, which was reported by "ceph df" and also
>> calculated by ourselves.
>>
>> The test is ongoing. I wonder whether the cluster would report OSD
>> full after we wrote about 300 million files, however the actual total
>> file size would be far far less than the disk usage. I will update the
>> result when the test is done.
>>
>> My question is, whether the disk usage statistics in bluestore is
>> inaccurate, or the padding, alignment stuff or something else in
>> bluestore wastes the disk space?
>
> Bluestore isn't making any attempt to optimize for small files, so a
> one byte file will consume min_alloc_size (64kb on HDD, 16kb on SSD,
> IIRC).
>
> It probably wouldn't be too difficult to add an "inline" data for small
> objects feature that puts small objects in rocksdb...
>
> sage
>
>>
>> Thanks!
>>
>> $ ceph osd df
>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>> 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
>> 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
>> 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
>> 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
>> 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
>> 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
>> 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
>> 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
>> 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
>> 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
>> 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
>> 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
>> 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
>> 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
>> 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
>> 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
>> 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
>> 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
>> 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
>> 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
>> 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
>> 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
>> TOTAL 40984G 18861G 22122G 46.02
>>
>> $ ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 40984G 22122G 18861G 46.02
>> POOLS:
>> NAME ID USED %USED MAX AVAIL OBJECTS
>> cephfs_metadata 5 160M 0 6964G 77342
>> cephfs_data 6 5193M 0.04 6964G 151292669
>>
>>
>> Regards,
>> Zhi Zhang (David)
>> Contact: zhang.david2011@gmail.com
>> zhangz.david@outlook.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
2017-12-26 3:25 Bluestore: inaccurate disk usage statistics problem? Zhi Zhang
2017-12-26 22:36 ` [ceph-users] " Sage Weil
@ 2018-01-04 12:39 ` Igor Fedotov
[not found] ` <1c3dd39a-5e0c-9fcf-5502-8a94c899b6cb-l3A5Bk7waGM@public.gmane.org>
1 sibling, 1 reply; 8+ messages in thread
From: Igor Fedotov @ 2018-01-04 12:39 UTC (permalink / raw)
To: Zhi Zhang, ceph-devel, ceph-users
Additional issue with the disk usage statistics I've just realized is
that BlueStore's statfs call reports total disk space as
block device total space + DB device total space
while available space is measured as
block device's free space + bluefs free space at block device -
bluestore_bluefs_free param
This results in higher used space value (as available space at DB
device isn't taken into account) and odd results when cluster is
(almost) empty.
IMO we shouldn't use DB device for total space calculation.
Sage, what do you think?
Thanks,
Igor
On 12/26/2017 6:25 AM, Zhi Zhang wrote:
> Hi,
>
> We recently started to test bluestore with huge amount of small files
> (only dozens of bytes per file). We have 22 OSDs in a test cluster
> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
> we wrote about 150 million files through cephfs, we found each OSD
> disk usage reported by "ceph osd df" was more than 40%, which meant
> more than 800GB was used for each disk, but the actual total file size
> was only about 5.2 GB, which was reported by "ceph df" and also
> calculated by ourselves.
>
> The test is ongoing. I wonder whether the cluster would report OSD
> full after we wrote about 300 million files, however the actual total
> file size would be far far less than the disk usage. I will update the
> result when the test is done.
>
> My question is, whether the disk usage statistics in bluestore is
> inaccurate, or the padding, alignment stuff or something else in
> bluestore wastes the disk space?
>
> Thanks!
>
> $ ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
> 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
> 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
> 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
> 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
> 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
> 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
> 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
> 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
> 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
> 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
> 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
> 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
> 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
> 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
> 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
> 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
> 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
> 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
> 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
> 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
> 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
> TOTAL 40984G 18861G 22122G 46.02
>
> $ ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 40984G 22122G 18861G 46.02
> POOLS:
> NAME ID USED %USED MAX AVAIL OBJECTS
> cephfs_metadata 5 160M 0 6964G 77342
> cephfs_data 6 5193M 0.04 6964G 151292669
>
>
> Regards,
> Zhi Zhang (David)
> Contact: zhang.david2011@gmail.com
> zhangz.david@outlook.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bluestore: inaccurate disk usage statistics problem?
[not found] ` <1c3dd39a-5e0c-9fcf-5502-8a94c899b6cb-l3A5Bk7waGM@public.gmane.org>
@ 2018-01-04 14:27 ` Sage Weil
[not found] ` <alpine.DEB.2.11.1801041424450.24931-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2018-01-04 14:27 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel, ceph-users
[-- Attachment #1: Type: TEXT/PLAIN, Size: 4558 bytes --]
On Thu, 4 Jan 2018, Igor Fedotov wrote:
> Additional issue with the disk usage statistics I've just realized is that
> BlueStore's statfs call reports total disk space as
>
> block device total space + DB device total space
>
> while available space is measured as
>
> block device's free space + bluefs free space at block device -
> bluestore_bluefs_free param
>
>
> This results in higher used space value (as available space at DB device
> isn't taken into account) and odd results when cluster is (almost) empty.
Isn't "bluefs free space at block device" the same as the db device free?
(Actually, bluefs may include part of main device too, but that would also
be reported as part of bluefs free space.)
sage
> IMO we shouldn't use DB device for total space calculation.
>
> Sage, what do you think?
>
> Thanks,
>
> Igor
>
>
>
> On 12/26/2017 6:25 AM, Zhi Zhang wrote:
> > Hi,
> >
> > We recently started to test bluestore with huge amount of small files
> > (only dozens of bytes per file). We have 22 OSDs in a test cluster
> > using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
> > we wrote about 150 million files through cephfs, we found each OSD
> > disk usage reported by "ceph osd df" was more than 40%, which meant
> > more than 800GB was used for each disk, but the actual total file size
> > was only about 5.2 GB, which was reported by "ceph df" and also
> > calculated by ourselves.
> >
> > The test is ongoing. I wonder whether the cluster would report OSD
> > full after we wrote about 300 million files, however the actual total
> > file size would be far far less than the disk usage. I will update the
> > result when the test is done.
> >
> > My question is, whether the disk usage statistics in bluestore is
> > inaccurate, or the padding, alignment stuff or something else in
> > bluestore wastes the disk space?
> >
> > Thanks!
> >
> > $ ceph osd df
> > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> > 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
> > 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
> > 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
> > 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
> > 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
> > 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
> > 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
> > 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
> > 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
> > 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
> > 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
> > 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
> > 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
> > 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
> > 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
> > 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
> > 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
> > 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
> > 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
> > 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
> > 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
> > 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
> > TOTAL 40984G 18861G 22122G 46.02
> >
> > $ ceph df
> > GLOBAL:
> > SIZE AVAIL RAW USED %RAW USED
> > 40984G 22122G 18861G 46.02
> > POOLS:
> > NAME ID USED %USED MAX AVAIL OBJECTS
> > cephfs_metadata 5 160M 0 6964G 77342
> > cephfs_data 6 5193M 0.04 6964G 151292669
> >
> >
> > Regards,
> > Zhi Zhang (David)
> > Contact: zhang.david2011-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> > zhangz.david-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bluestore: inaccurate disk usage statistics problem?
[not found] ` <alpine.DEB.2.11.1801041424450.24931-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2018-01-04 14:42 ` Igor Fedotov
2018-01-04 14:52 ` [ceph-users] " Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Igor Fedotov @ 2018-01-04 14:42 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, ceph-users
On 1/4/2018 5:27 PM, Sage Weil wrote:
> On Thu, 4 Jan 2018, Igor Fedotov wrote:
>> Additional issue with the disk usage statistics I've just realized is that
>> BlueStore's statfs call reports total disk space as
>>
>> block device total space + DB device total space
>>
>> while available space is measured as
>>
>> block device's free space + bluefs free space at block device -
>> bluestore_bluefs_free param
>>
>>
>> This results in higher used space value (as available space at DB device
>> isn't taken into account) and odd results when cluster is (almost) empty.
> Isn't "bluefs free space at block device" the same as the db device free?
I suppose - No. Looks like Bluefs reports free space on per-device basis:
uint64_t BlueFS::get_free(unsigned id)
{
std::lock_guard<std::mutex> l(lock);
assert(id < alloc.size());
return alloc[id]->get_free();
}
hence bluefs->get_free(bluefs_shared_bdev) from statfs returns bluefs
free space at block device only.
> (Actually, bluefs may include part of main device too, but that would also
> be reported as part of bluefs free space.)
>
> sage
>
>> IMO we shouldn't use DB device for total space calculation.
>>
>> Sage, what do you think?
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>> On 12/26/2017 6:25 AM, Zhi Zhang wrote:
>>> Hi,
>>>
>>> We recently started to test bluestore with huge amount of small files
>>> (only dozens of bytes per file). We have 22 OSDs in a test cluster
>>> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
>>> we wrote about 150 million files through cephfs, we found each OSD
>>> disk usage reported by "ceph osd df" was more than 40%, which meant
>>> more than 800GB was used for each disk, but the actual total file size
>>> was only about 5.2 GB, which was reported by "ceph df" and also
>>> calculated by ourselves.
>>>
>>> The test is ongoing. I wonder whether the cluster would report OSD
>>> full after we wrote about 300 million files, however the actual total
>>> file size would be far far less than the disk usage. I will update the
>>> result when the test is done.
>>>
>>> My question is, whether the disk usage statistics in bluestore is
>>> inaccurate, or the padding, alignment stuff or something else in
>>> bluestore wastes the disk space?
>>>
>>> Thanks!
>>>
>>> $ ceph osd df
>>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>>> 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
>>> 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
>>> 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
>>> 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
>>> 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
>>> 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
>>> 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
>>> 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
>>> 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
>>> 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
>>> 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
>>> 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
>>> 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
>>> 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
>>> 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
>>> 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
>>> 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
>>> 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
>>> 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
>>> 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
>>> 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
>>> 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
>>> TOTAL 40984G 18861G 22122G 46.02
>>>
>>> $ ceph df
>>> GLOBAL:
>>> SIZE AVAIL RAW USED %RAW USED
>>> 40984G 22122G 18861G 46.02
>>> POOLS:
>>> NAME ID USED %USED MAX AVAIL OBJECTS
>>> cephfs_metadata 5 160M 0 6964G 77342
>>> cephfs_data 6 5193M 0.04 6964G 151292669
>>>
>>>
>>> Regards,
>>> Zhi Zhang (David)
>>> Contact: zhang.david2011@gmail.com
>>> zhangz.david@outlook.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
2018-01-04 14:42 ` Igor Fedotov
@ 2018-01-04 14:52 ` Sage Weil
2018-01-04 14:58 ` Igor Fedotov
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2018-01-04 14:52 UTC (permalink / raw)
To: Igor Fedotov; +Cc: Zhi Zhang, ceph-devel, ceph-users
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5971 bytes --]
On Thu, 4 Jan 2018, Igor Fedotov wrote:
> On 1/4/2018 5:27 PM, Sage Weil wrote:
> > On Thu, 4 Jan 2018, Igor Fedotov wrote:
> > > Additional issue with the disk usage statistics I've just realized is that
> > > BlueStore's statfs call reports total disk space as
> > >
> > > block device total space + DB device total space
> > >
> > > while available space is measured as
> > >
> > > block device's free space + bluefs free space at block device -
> > > bluestore_bluefs_free param
> > >
> > >
> > > This results in higher used space value (as available space at DB device
> > > isn't taken into account) and odd results when cluster is (almost) empty.
> > Isn't "bluefs free space at block device" the same as the db device free?
> I suppose - No. Looks like Bluefs reports free space on per-device basis:
> uint64_t BlueFS::get_free(unsigned id)
> {
> std::lock_guard<std::mutex> l(lock);
> assert(id < alloc.size());
> return alloc[id]->get_free();
> }
> hence bluefs->get_free(bluefs_shared_bdev) from statfs returns bluefs free
> space at block device only.
I see. So we can either add in the db device to have total/free agree in
scope, but some of that space is special (can't store objects), or we
report only the primary device and some of the omap capacity is "hidden."
I lean toward the latter since we also can't account for omap usage
currently. (This I think we can improve, though, by making all of the
omap keys prefixed by the pool id and making use of the rocksdb usage
estimation methods.)
sage
> > (Actually, bluefs may include part of main device too, but that would also
> > be reported as part of bluefs free space.)
> >
> > sage
> >
> > > IMO we shouldn't use DB device for total space calculation.
> > >
> > > Sage, what do you think?
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > >
> > >
> > > On 12/26/2017 6:25 AM, Zhi Zhang wrote:
> > > > Hi,
> > > >
> > > > We recently started to test bluestore with huge amount of small files
> > > > (only dozens of bytes per file). We have 22 OSDs in a test cluster
> > > > using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
> > > > we wrote about 150 million files through cephfs, we found each OSD
> > > > disk usage reported by "ceph osd df" was more than 40%, which meant
> > > > more than 800GB was used for each disk, but the actual total file size
> > > > was only about 5.2 GB, which was reported by "ceph df" and also
> > > > calculated by ourselves.
> > > >
> > > > The test is ongoing. I wonder whether the cluster would report OSD
> > > > full after we wrote about 300 million files, however the actual total
> > > > file size would be far far less than the disk usage. I will update the
> > > > result when the test is done.
> > > >
> > > > My question is, whether the disk usage statistics in bluestore is
> > > > inaccurate, or the padding, alignment stuff or something else in
> > > > bluestore wastes the disk space?
> > > >
> > > > Thanks!
> > > >
> > > > $ ceph osd df
> > > > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> > > > 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
> > > > 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
> > > > 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
> > > > 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
> > > > 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
> > > > 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
> > > > 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
> > > > 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
> > > > 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
> > > > 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
> > > > 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
> > > > 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
> > > > 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
> > > > 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
> > > > 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
> > > > 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
> > > > 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
> > > > 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
> > > > 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
> > > > 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
> > > > 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
> > > > 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
> > > > TOTAL 40984G 18861G 22122G 46.02
> > > >
> > > > $ ceph df
> > > > GLOBAL:
> > > > SIZE AVAIL RAW USED %RAW USED
> > > > 40984G 22122G 18861G 46.02
> > > > POOLS:
> > > > NAME ID USED %USED MAX AVAIL
> > > > OBJECTS
> > > > cephfs_metadata 5 160M 0 6964G
> > > > 77342
> > > > cephfs_data 6 5193M 0.04 6964G
> > > > 151292669
> > > >
> > > >
> > > > Regards,
> > > > Zhi Zhang (David)
> > > > Contact: zhang.david2011@gmail.com
> > > > zhangz.david@outlook.com
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [ceph-users] Bluestore: inaccurate disk usage statistics problem?
2018-01-04 14:52 ` [ceph-users] " Sage Weil
@ 2018-01-04 14:58 ` Igor Fedotov
0 siblings, 0 replies; 8+ messages in thread
From: Igor Fedotov @ 2018-01-04 14:58 UTC (permalink / raw)
To: Sage Weil; +Cc: Zhi Zhang, ceph-devel, ceph-users
On 1/4/2018 5:52 PM, Sage Weil wrote:
> On Thu, 4 Jan 2018, Igor Fedotov wrote:
>> On 1/4/2018 5:27 PM, Sage Weil wrote:
>>> On Thu, 4 Jan 2018, Igor Fedotov wrote:
>>>> Additional issue with the disk usage statistics I've just realized is that
>>>> BlueStore's statfs call reports total disk space as
>>>>
>>>> block device total space + DB device total space
>>>>
>>>> while available space is measured as
>>>>
>>>> block device's free space + bluefs free space at block device -
>>>> bluestore_bluefs_free param
>>>>
>>>>
>>>> This results in higher used space value (as available space at DB device
>>>> isn't taken into account) and odd results when cluster is (almost) empty.
>>> Isn't "bluefs free space at block device" the same as the db device free?
>> I suppose - No. Looks like Bluefs reports free space on per-device basis:
>> uint64_t BlueFS::get_free(unsigned id)
>> {
>> std::lock_guard<std::mutex> l(lock);
>> assert(id < alloc.size());
>> return alloc[id]->get_free();
>> }
>> hence bluefs->get_free(bluefs_shared_bdev) from statfs returns bluefs free
>> space at block device only.
> I see. So we can either add in the db device to have total/free agree in
> scope, but some of that space is special (can't store objects), or we
> report only the primary device and some of the omap capacity is "hidden."
>
> I lean toward the latter since we also can't account for omap usage
> currently. (This I think we can improve, though, by making all of the
> omap keys prefixed by the pool id and making use of the rocksdb usage
> estimation methods.)
+1 for the latter
> sage
>
>>> (Actually, bluefs may include part of main device too, but that would also
>>> be reported as part of bluefs free space.)
>>>
>>> sage
>>>
>>>> IMO we shouldn't use DB device for total space calculation.
>>>>
>>>> Sage, what do you think?
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>>
>>>> On 12/26/2017 6:25 AM, Zhi Zhang wrote:
>>>>> Hi,
>>>>>
>>>>> We recently started to test bluestore with huge amount of small files
>>>>> (only dozens of bytes per file). We have 22 OSDs in a test cluster
>>>>> using ceph-12.2.1 with 2 replicas and each OSD disk is 2TB size. After
>>>>> we wrote about 150 million files through cephfs, we found each OSD
>>>>> disk usage reported by "ceph osd df" was more than 40%, which meant
>>>>> more than 800GB was used for each disk, but the actual total file size
>>>>> was only about 5.2 GB, which was reported by "ceph df" and also
>>>>> calculated by ourselves.
>>>>>
>>>>> The test is ongoing. I wonder whether the cluster would report OSD
>>>>> full after we wrote about 300 million files, however the actual total
>>>>> file size would be far far less than the disk usage. I will update the
>>>>> result when the test is done.
>>>>>
>>>>> My question is, whether the disk usage statistics in bluestore is
>>>>> inaccurate, or the padding, alignment stuff or something else in
>>>>> bluestore wastes the disk space?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> $ ceph osd df
>>>>> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
>>>>> 0 hdd 1.49728 1.00000 1862G 853G 1009G 45.82 1.00 110
>>>>> 1 hdd 1.69193 1.00000 1862G 807G 1054G 43.37 0.94 105
>>>>> 2 hdd 1.81929 1.00000 1862G 811G 1051G 43.57 0.95 116
>>>>> 3 hdd 2.00700 1.00000 1862G 839G 1023G 45.04 0.98 122
>>>>> 4 hdd 2.06334 1.00000 1862G 886G 976G 47.58 1.03 130
>>>>> 5 hdd 1.99051 1.00000 1862G 856G 1006G 45.95 1.00 118
>>>>> 6 hdd 1.67519 1.00000 1862G 881G 981G 47.32 1.03 114
>>>>> 7 hdd 1.81929 1.00000 1862G 874G 988G 46.94 1.02 120
>>>>> 8 hdd 2.08881 1.00000 1862G 885G 976G 47.56 1.03 130
>>>>> 9 hdd 1.64265 1.00000 1862G 852G 1010G 45.78 0.99 106
>>>>> 10 hdd 1.81929 1.00000 1862G 873G 989G 46.88 1.02 109
>>>>> 11 hdd 2.20041 1.00000 1862G 915G 947G 49.13 1.07 131
>>>>> 12 hdd 1.45694 1.00000 1862G 874G 988G 46.94 1.02 110
>>>>> 13 hdd 2.03847 1.00000 1862G 821G 1041G 44.08 0.96 113
>>>>> 14 hdd 1.53812 1.00000 1862G 810G 1052G 43.50 0.95 112
>>>>> 15 hdd 1.52914 1.00000 1862G 874G 988G 46.94 1.02 111
>>>>> 16 hdd 1.99176 1.00000 1862G 810G 1052G 43.51 0.95 114
>>>>> 17 hdd 1.81929 1.00000 1862G 841G 1021G 45.16 0.98 119
>>>>> 18 hdd 1.70901 1.00000 1862G 831G 1031G 44.61 0.97 113
>>>>> 19 hdd 1.67519 1.00000 1862G 875G 987G 47.02 1.02 115
>>>>> 20 hdd 2.03847 1.00000 1862G 864G 998G 46.39 1.01 115
>>>>> 21 hdd 2.18794 1.00000 1862G 920G 942G 49.39 1.07 127
>>>>> TOTAL 40984G 18861G 22122G 46.02
>>>>>
>>>>> $ ceph df
>>>>> GLOBAL:
>>>>> SIZE AVAIL RAW USED %RAW USED
>>>>> 40984G 22122G 18861G 46.02
>>>>> POOLS:
>>>>> NAME ID USED %USED MAX AVAIL
>>>>> OBJECTS
>>>>> cephfs_metadata 5 160M 0 6964G
>>>>> 77342
>>>>> cephfs_data 6 5193M 0.04 6964G
>>>>> 151292669
>>>>>
>>>>>
>>>>> Regards,
>>>>> Zhi Zhang (David)
>>>>> Contact: zhang.david2011@gmail.com
>>>>> zhangz.david@outlook.com
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-01-04 14:58 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-26 3:25 Bluestore: inaccurate disk usage statistics problem? Zhi Zhang
2017-12-26 22:36 ` [ceph-users] " Sage Weil
2017-12-27 2:59 ` Zhi Zhang
2018-01-04 12:39 ` Igor Fedotov
[not found] ` <1c3dd39a-5e0c-9fcf-5502-8a94c899b6cb-l3A5Bk7waGM@public.gmane.org>
2018-01-04 14:27 ` Sage Weil
[not found] ` <alpine.DEB.2.11.1801041424450.24931-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2018-01-04 14:42 ` Igor Fedotov
2018-01-04 14:52 ` [ceph-users] " Sage Weil
2018-01-04 14:58 ` Igor Fedotov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.