Re: Tux3 Report: Meet Shardmap, the designated successor of HTree

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
@ 2013-06-20 17:02 Daniel Phillips
  2013-06-20 19:11 ` Christian Stroetmann
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Phillips @ 2013-06-20 17:02 UTC (permalink / raw)
  To: Christian Stroetmann, linux-kernel, linux-fsdevel, tux3

Hi Christian,

You are welcome, and I hope that your project can make good use of this 
technology. Please do credit your sources if you use these ideas, and 
please keep the CC list intact in further replies.

What is the scale of your application, that is, how many index entries 
do you expect?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-20 17:02 Tux3 Report: Meet Shardmap, the designated successor of HTree Daniel Phillips
@ 2013-06-20 19:11 ` Christian Stroetmann
  2013-06-20 20:27   ` Daniel Phillips
  2013-06-24 14:18   ` Pavel Machek
  0 siblings, 2 replies; 13+ messages in thread
From: Christian Stroetmann @ 2013-06-20 19:11 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linux FS Devel, Linux Kernel, Tux3

Hello Mr. Daniel Philips,

I'm sorry to say so, but your are really a funny person.

At first you came up with a file system that can handle a great 
many/billions files and has ACID feature, which are both features of my 
Ontologic File System (OntoFS; see [1]). Both were said to be a no-go at 
that time (around 2007 and 2008).
Then you came up, with my concept of a log-structured hashing based file 
system [2] and [3], presented it as your invention yesterday [4], and 
even integrated it with your Tux3 file system that already has or should 
have the said features of my OntoFS. I only waited for this step by 
somebody strongly connected with the company Samsung since the October 
2012. AIso, I do think that both steps are very clear signs that shows 
what is going on behind the curtain.
And now your are so bold and please me that I should credit these ideas 
in the sense of crediting your ideas. For sure, I always do claim for 
copyright of my ideas, and the true question is if you are allowed to 
implement them at all. In this conjunction, I would give the other 
mailing list members the information as well, that I do not need 
something technical from you at all, that has to be credited. I only 
meant it as part of a broad hint and for lowering the noise on the 
mailing list.
Besides this, your permanent marketing by using speech acts from my 
websites is annoying, as it is the case with playing here the unknown 
now. Furthermore, you already were given the count with your last 
screwed test and you have nothing better to do than to come up with my 
log-structured hashing based file system and again a marketing story. I 
really have to ask the question: Who do you want to kid? Who do you want 
to provoke? Who do you want to mislead?

Also, I truely thought that the broad hints given some weeks ago and 
yesterday again would be clear enough, and I still think so respectively 
that you really got the issue. But if as a matter of fact this might be 
not the case, I simply say it directly without any decorating flowers:
1. Stop copying my intellectual properties related with file systems and 
implementing them. You always came several months too late and I am not 
interested to let it become a running gag, definitely.
2. Stop marketing my ideas, especially in a way that confuses the public 
about the true origin even further. I am already marketing them on my own.
3. Give credits to my intellectual properties in any case, even if you 
make a derivation, and take care about the correct licensing.

[1] OntoFS (www.ontolinux.com/technology/ontofs.htm)
[2] SASOS4Fun (www.ontonics.com/innovation/pipeline.htm#sasos4fun) Do 
not confuse SIP with SipHash, but put SipHash in relation with "the size 
of [a] hash table [is determined] by sampling the input", as we 
understood the description.
[3] Ontonics, OntoLab, and OntoLinux Further steps 
(www.ontomax.com/newsarchive/2012/october.htm#08.October.2012)
[4] Meet Shardmap, the designated successor of HTree 
(lkml.org/lkml/2013/6/18/869)

Btw. 1: Firstly, Daniel Philips had no CC list at all with his initial 
e-mail. Secondly, the issue with the CC list on my side was a mistake on 
the one hand (forgot to push the CC button) and a part of the last broad 
hint on the other hand.

Btw. 2: Are you Google and now Samsung or both?

Sincerely
Christian Stroetmann

> Hi Christian,
>
> You are welcome, and I hope that your project can make good use of this
> technology. Please do credit your sources if you use these ideas, and
> please keep the CC list intact in further replies.
>
> What is the scale of your application, that is, how many index entries
> do you expect?
>
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-20 19:11 ` Christian Stroetmann
@ 2013-06-20 20:27   ` Daniel Phillips
  2015-04-15 18:00     ` Christian Stroetmann
  2013-06-24 14:18   ` Pavel Machek
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Phillips @ 2013-06-20 20:27 UTC (permalink / raw)
  To: Christian Stroetmann; +Cc: Linux FS Devel, Linux Kernel, Tux3

On 06/20/2013 12:12 PM, Christian Stroetmann wrote:
> 1. Stop copying my intellectual properties related with file systems and
> implementing them. You always came several months too late and I am not
> interested to let it become a running gag, definitely.
> 2. Stop marketing my ideas, especially in a way that confuses the public
> about the true origin even further. I am already marketing them on my own.
> 3. Give credits to my intellectual properties in any case, even if you
> make a derivation, and take care about the correct licensing.

Could you please direct us to details of your design so that we may 
properly appreciate it?

Note that the key idea in Shardmap is not simply logging a hash table, 
but sharding it and logging it as a forest of fifos.

Regards,

Daniel


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-20 19:11 ` Christian Stroetmann
  2013-06-20 20:27   ` Daniel Phillips
@ 2013-06-24 14:18   ` Pavel Machek
  2013-06-24 15:16     ` Christian Stroetmann
  1 sibling, 1 reply; 13+ messages in thread
From: Pavel Machek @ 2013-06-24 14:18 UTC (permalink / raw)
  To: Christian Stroetmann; +Cc: Daniel Phillips, Linux FS Devel, Linux Kernel, Tux3

Hi!

> At first you came up with a file system that can handle a great
> many/billions files and has ACID feature, which are both features of
> my Ontologic File System (OntoFS; see [1]). Both were said to be a
> no-go at that time (around 2007 and 2008).
> Then you came up, with my concept of a log-structured hashing based
> file system [2] and [3], presented it as your invention yesterday
> [4], and even integrated it with your Tux3 file system that already
> has or should have the said features of my OntoFS. I only waited for
> this step by somebody strongly connected with the company Samsung
> since the October 2012. AIso, I do think that both steps are very
> clear signs that shows what is going on behind the curtain.
> And now your are so bold and please me that I should credit these
> ideas in the sense of crediting your ideas. For sure, I always do
> claim for copyright of my ideas, and the true question is if you are
> allowed to implement them at all. In this conjunction, I would give

Fortunately, you can't copyright ideas. Chuck Norris managed to do it
once, but you can't.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-24 14:18   ` Pavel Machek
@ 2013-06-24 15:16     ` Christian Stroetmann
  2013-06-24 15:39       ` Andreas Karlsson
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Stroetmann @ 2013-06-24 15:16 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linux Kernel, Linux FS Devel, Tux3, Daniel Phillips

Dear Mr. Pavel Machek:

Is this a serious comment?
Nevertheless, this is a copyrighted idea [1].


Sincerely
Christian Stroetmann

[1] Log-Structured Hash-based File System (LogHashFS or LHFS; 
www.ontonics.com/innovation/pipeline.htm#loghashfs)

> Hi!
>
>> At first you came up with a file system that can handle a great
>> many/billions files and has ACID feature, which are both features of
>> my Ontologic File System (OntoFS; see [1]). Both were said to be a
>> no-go at that time (around 2007 and 2008).
>> Then you came up, with my concept of a log-structured hashing based
>> file system [2] and [3], presented it as your invention yesterday
>> [4], and even integrated it with your Tux3 file system that already
>> has or should have the said features of my OntoFS. I only waited for
>> this step by somebody strongly connected with the company Samsung
>> since the October 2012. AIso, I do think that both steps are very
>> clear signs that shows what is going on behind the curtain.
>> And now your are so bold and please me that I should credit these
>> ideas in the sense of crediting your ideas. For sure, I always do
>> claim for copyright of my ideas, and the true question is if you are
>> allowed to implement them at all. In this conjunction, I would give
> Fortunately, you can't copyright ideas. Chuck Norris managed to do it
> once, but you can't.
>
> 									Pavel



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-24 15:16     ` Christian Stroetmann
@ 2013-06-24 15:39       ` Andreas Karlsson
  2013-06-24 17:19         ` Christian Stroetmann
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Karlsson @ 2013-06-24 15:39 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: Pavel Machek, Linux FS Devel, Tux3, Daniel Phillips, Linux Kernel

Hi,

I assume it is serious since ideas cannot be copyrighted in most (or 
maybe even all) countries.

 From the FAQ of the U.S. Copyright Office [1]:

"Copyright does not protect facts, ideas, systems, or methods of 
operation, although it may protect the way these things are expressed."

"How do I protect my idea?

Copyright does not protect ideas, concepts, systems, or methods of doing 
something. You may express your ideas in writing or drawings and claim 
copyright in your description, but be aware that copyright will not 
protect the idea itself as revealed in your written or artistic work."

Andreas

[1] http://www.copyright.gov/help/faq/faq-protect.html#what_protect

On 06/24/2013 05:16 PM, Christian Stroetmann wrote:
> Dear Mr. Pavel Machek:
>
> Is this a serious comment?
> Nevertheless, this is a copyrighted idea [1].
>
>
> Sincerely
> Christian Stroetmann
>
> [1] Log-Structured Hash-based File System (LogHashFS or LHFS;
> www.ontonics.com/innovation/pipeline.htm#loghashfs)
>
>> Hi!
>>
>>> At first you came up with a file system that can handle a great
>>> many/billions files and has ACID feature, which are both features of
>>> my Ontologic File System (OntoFS; see [1]). Both were said to be a
>>> no-go at that time (around 2007 and 2008).
>>> Then you came up, with my concept of a log-structured hashing based
>>> file system [2] and [3], presented it as your invention yesterday
>>> [4], and even integrated it with your Tux3 file system that already
>>> has or should have the said features of my OntoFS. I only waited for
>>> this step by somebody strongly connected with the company Samsung
>>> since the October 2012. AIso, I do think that both steps are very
>>> clear signs that shows what is going on behind the curtain.
>>> And now your are so bold and please me that I should credit these
>>> ideas in the sense of crediting your ideas. For sure, I always do
>>> claim for copyright of my ideas, and the true question is if you are
>>> allowed to implement them at all. In this conjunction, I would give
>> Fortunately, you can't copyright ideas. Chuck Norris managed to do it
>> once, but you can't.
>>
>> Pavel
>
>
>
> _______________________________________________
> Tux3 mailing list
> Tux3@phunq.net
> http://phunq.net/mailman/listinfo/tux3


-- 
Andreas Karlsson

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-24 15:39       ` Andreas Karlsson
@ 2013-06-24 17:19         ` Christian Stroetmann
  2013-06-24 18:20           ` richard -rw- weinberger
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Stroetmann @ 2013-06-24 17:19 UTC (permalink / raw)
  To: Andreas Karlsson
  Cc: Linux Kernel, Linux FS Devel, Daniel Phillips, Pavel Machek

Dear Mr. Andreas Karlsson:

Thank you for the quote. If we really want to bring this down to the 
lowest level, then an idea is not copyrighted per se, for sure.

But, what is the problem? That I have used the wrong word? The context 
was clear and hence the meaning of the word "ideas". So, simply add the 
word "described", "presented", or "publicated" before the word "ideas" 
or substitute the word "ideas" with the term "technical descriptions 
given in textual form".

Or is the problem that the idea has been publicated and everybody is now 
free to take it? The answer is: yes and no.
Yes:
Everybody can implement an idea, concept, system, or method of operation 
even if it is given with a copyrighted representation, and give it away 
or even sell it, because in the latter case it is not patented. But how 
should somebody give something away or sell something without giving a 
description about what it is? This leads to the other case.

No:
What you and many other persons might misunderstand is the fact that if 
an idea is copyrighted by being represented in a textual, visual, or 
other form, it even does not matter at all if somebody else takes 
another text or image as long as the sense/the idea described in this 
other way is still the same as described in the original text or image, 
because it does not matter in which way the copyrighted thing is 
communicated. As a good example take a melody publicated with written 
notes. It does not matter on which instrument you play it or in which 
style you sing it. It is still the same melody. Or take a written script 
of a movie that is narrated from the view of the main protagonist in the 
original plot and from the view of another actor or a voice in the back 
in a copied plot. In all cases it is still the same copyrighted idea/story.

Said this, every other description or documentation that uses different 
terms, a source code, even a visual graphic, or whatever representation 
of a file system that has the characteristical features of my 
log-structured hashing based data storage system with one or more of the 
optional features like for example consistent hashing, different data 
structures on the physical storage system and in the (virtual) memory, 
finger table, logged finger or hash tables in memory, and ACID 
properties (see again the given links), as it is the case with the 
latest description of the Tux3 FS with Shardmap, transports/communicates 
the copyrighted description of the same idea/concept/system in large 
parts or even as a whole somehow. And simply coding and compiling it 
does not help as well due to the many possibilities of re-engineering.

Regards
Christian Stroetmann

> Hi,
>
> I assume it is serious since ideas cannot be copyrighted in most (or 
> maybe even all) countries.
>
> From the FAQ of the U.S. Copyright Office [1]:
>
> "Copyright does not protect facts, ideas, systems, or methods of 
> operation, although it may protect the way these things are expressed."
>
> "How do I protect my idea?
>
> Copyright does not protect ideas, concepts, systems, or methods of 
> doing something. You may express your ideas in writing or drawings and 
> claim copyright in your description, but be aware that copyright will 
> not protect the idea itself as revealed in your written or artistic 
> work."
>
> Andreas
>
> [1] http://www.copyright.gov/help/faq/faq-protect.html#what_protect
>
> On 06/24/2013 05:16 PM, Christian Stroetmann wrote:
>> Dear Mr. Pavel Machek:
>>
>> Is this a serious comment?
>> Nevertheless, this is a copyrighted idea [1].
>>
>>
>> Sincerely
>> Christian Stroetmann
>>
>> [1] Log-Structured Hash-based File System (LogHashFS or LHFS;
>> www.ontonics.com/innovation/pipeline.htm#loghashfs)
>>
>>> Hi!
>>>
>>>> At first you came up with a file system that can handle a great
>>>> many/billions files and has ACID feature, which are both features of
>>>> my Ontologic File System (OntoFS; see [1]). Both were said to be a
>>>> no-go at that time (around 2007 and 2008).
>>>> Then you came up, with my concept of a log-structured hashing based
>>>> file system [2] and [3], presented it as your invention yesterday
>>>> [4], and even integrated it with your Tux3 file system that already
>>>> has or should have the said features of my OntoFS. I only waited for
>>>> this step by somebody strongly connected with the company Samsung
>>>> since the October 2012. AIso, I do think that both steps are very
>>>> clear signs that shows what is going on behind the curtain.
>>>> And now your are so bold and please me that I should credit these
>>>> ideas in the sense of crediting your ideas. For sure, I always do
>>>> claim for copyright of my ideas, and the true question is if you are
>>>> allowed to implement them at all. In this conjunction, I would give
>>> Fortunately, you can't copyright ideas. Chuck Norris managed to do it
>>> once, but you can't.
>>>
>>> Pavel
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-24 17:19         ` Christian Stroetmann
@ 2013-06-24 18:20           ` richard -rw- weinberger
  2013-06-25  0:34             ` Christian Stroetmann
  0 siblings, 1 reply; 13+ messages in thread
From: richard -rw- weinberger @ 2013-06-24 18:20 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: Andreas Karlsson, Linux Kernel, Linux FS Devel, Daniel Phillips,
	Pavel Machek

Let's do the same as in 2009[1] and finish this thread.

[1] http://www.spinics.net/lists/reiserfs-devel/msg01543.html

--
Thanks,
//richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-25  0:34             ` Christian Stroetmann
@ 2013-06-25  0:15               ` David Lang
  0 siblings, 0 replies; 13+ messages in thread
From: David Lang @ 2013-06-25  0:15 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: richard -rw- weinberger, Linux Kernel, Linux FS Devel, Tux3,
	Andreas Karlsson, Daniel Phillips, Pavel Machek

On Tue, 25 Jun 2013, Christian Stroetmann wrote:

> Dear Mr. Richard Weinberger:
>
> Thank you very much for the reminder and the prove again that a profound 
> discussion seems not to be possible. Even more important is the point that 
> the discussion related with the ReiserFS was different than this discussion, 
> because this time I have not presented the LogHashFS to the open source 
> community, but another person has taken copyright descriptions from my 
> websites and wanted to make it an open source project and this even by the 
> support of another company, which by the way has its very own business 
> strategy.

unless they copied your descriptions pretty close to word for word it's not a 
copyright violation.

It's perfectly legal to take the ideas from one document and write a new 
document that explains those ideas. The copyright is on the exact text, not on 
the ideas.

Patents give you the right to the idea, Copyright only gives you the right to 
the particular expression of the idea.

If you published a paper explaining the concept of a LogHashFS that contained no 
code, then anyone who actually wrote a filesystem implementing the ideas in your 
paper could not possibly be violating your copyright (unless they included too 
much of your paper in comments), because they wrote code, not a paper.

David Lang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-24 18:20           ` richard -rw- weinberger
@ 2013-06-25  0:34             ` Christian Stroetmann
  2013-06-25  0:15               ` David Lang
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Stroetmann @ 2013-06-25  0:34 UTC (permalink / raw)
  To: richard -rw- weinberger
  Cc: Linux Kernel, Linux FS Devel, Tux3, Andreas Karlsson,
	Daniel Phillips, Pavel Machek

Dear Mr. Richard Weinberger:

Thank you very much for the reminder and the prove again that a profound 
discussion seems not to be possible. Even more important is the point 
that the discussion related with the ReiserFS was different than this 
discussion, because this time I have not presented the LogHashFS to the 
open source community, but another person has taken copyright 
descriptions from my websites and wanted to make it an open source 
project and this even by the support of another company, which by the 
way has its very own business strategy.

Now, other and I have heard what we wanted: In the moment you have no 
more arguments you become offensive and begin to mob and to intirigue. 
Besides this, ReiserFS is virtually dead .... Furthermore, that 
journalist from the Linux Magazin said it due to other political and 
economical reasons in the B.R.D. as well and most potentially did never 
something that is important for the open source community. Sooner or 
later he will get a letter from my attorney  for this offensive with the 
demand to beg for pardon publicly in the ReiserFS and Tux3 mailing lists.

Said this, I will not sent any e-mails to this Chuck Nonsense thread 
anymore. It was a mistake at all to try it again.

Sincerely
Christian Stroetmann

> Let's do the same as in 2009[1] and finish this thread.
>
> [1] http://www.spinics.net/lists/reiserfs-devel/msg01543.html
>
> --
> Thanks,
> //richard
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-20 20:27   ` Daniel Phillips
@ 2015-04-15 18:00     ` Christian Stroetmann
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Stroetmann @ 2015-04-15 18:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Christian Stroetmann, Linux FS Devel, Linux Kernel, Tux3

On the 20th of June 2013 22:27, Daniel Phillips wrote:
> On 06/20/2013 12:12 PM, Christian Stroetmann wrote:
>> 1. Stop copying my intellectual properties related with file systems and
>> implementing them. You always came several months too late and I am not
>> interested to let it become a running gag, definitely.
>> 2. Stop marketing my ideas, especially in a way that confuses the public
>> about the true origin even further. I am already marketing them on my own.
>> 3. Give credits to my intellectual properties in any case, even if you
>> make a derivation, and take care about the correct licensing.
> Could you please direct us to details of your design so that we may
> properly appreciate it?
>
> Note that the key idea in Shardmap is not simply logging a hash table,
> but sharding it and logging it as a forest of fifos.
>
> Regards,
>
> Daniel
>

Around 2 years ago, I looked at some details of the FS design and 
discussed the copyright issue with one of my attorneys.
Today, I would like to make the following (maybe closing) words before 
somebody says I would block a development:
1. In general, there is a copyright for every protectable work done by a 
person in the moment of its publication, but in practice it is hard to 
prove, specifically in such a technical case. I will not go into the 
legal details.
Said this, at least I reject my claims, but still think that generally 
it would by a constructive measure if even ideas are referenced in 
relation with open source hard- and software.
In my case it led to the situation that I have stopped to publicate 
ideas, with some very few exceptions.
2. In particular respectively from the point of view of the software 
design, the implementation is virtually what I have proposed (as well). 
Indeed, there are some interesting details and elegant paraphrases.
3. I think it would be interesting to analyze how well this FS works 
respectively to compare this FS with databases that implement something 
surprisingly similar.

C.S.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Tux3 Report: Meet Shardmap, the designated successor of HTree
  2013-06-19  2:31 Daniel Phillips
@ 2013-06-19 18:17 ` Christian Stroetmann
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Stroetmann @ 2013-06-19 18:17 UTC (permalink / raw)
  To: Linux FS Devel, Linux Kernel

Aloha everybody

We would like to thank the developers very much for giving technical 
details about how we could implement our file system indexing (see [1] 
and [2]).

[1] SASOS4Fun (www.ontonics.com/innovation/pipeline.htm#sasos4fun) Do 
not confuse SIP with SipHash, but put SipHash in relation with "the size 
of [a] hash table [is determined] by sampling the input", as we 
understood the description.
[2] Ontonics, OntoLab, and OntoLinux Further steps 
(www.ontomax.com/newsarchive/2012/october.htm#08.October.2012)

Sincerely
Christian Stroetmann

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Tux3 Report: Meet Shardmap, the designated successor of HTree
@ 2013-06-19  2:31 Daniel Phillips
  2013-06-19 18:17 ` Christian Stroetmann
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Phillips @ 2013-06-19  2:31 UTC (permalink / raw)
  To: linux-kernel, tux3, linux-fsdevel

Greetings all,

 From time to time, one may fortunate enough to be blessed with a 
discovery in computer science that succeeds at improving all four of 
performance, scalability, reliability and simplicity. Of these normally 
conflicting goals, simplicity is usually the most elusive. It is 
therefore with considerable satisfaction that I present the results of 
our recent development work in directory indexing technology, which 
addresses some long-standing and vexing scalability problems exhibited 
by HTree, my previous contribution to the art of directory indexing. 
This new approach, Shardmap, will not only enhance Tux3 scalability, but 
provides an upgrade path for Ext4 and Lustre as well. Shardmap is also 
likely to be interesting for high performance database design. Best of 
all, Shardmap is considerably simpler than the technology we expect it 
to replace.

The most interesting thing about Shardmap is that it remained 
undiscovered for so long. I expect that you will agree that this is 
particularly impressive, considering how obvious Shardmap is in 
retrospect. I can only speculate that the reason for not seeing this 
obvious solution is that we never asked the right question. The question 
should have been: how do we fix this write multiplication issue? Instead 
we spent ten years asking: what should be do about this cache 
thrashing?. It turns out that an answer to the former is also an answer 
to the latter.

Now let us proceed without further ado to a brief tour of Shardmap, 
starting with the technology we expect it to replace.

The Problem with HTree

Occasionally we see LKML reports of performance issues in HTree at high 
scale, usually from people running scalability benchmarks. Lustre users 
have encountered these issues in real life. I always tended to shy away 
from those discussions because, frankly, I did not see any satisfactory 
answer, other than that HTree works perfectly well at the scale it was 
designed for and at which it is normally used. Recently I did learn the 
right answer: HTree is unfixable, and this is true of any media backed 
B-Tree index. Let me reiterate: contrary to popular opinion, a media 
backed B-Tree is an abysmally poor choice of information structure for 
any randomly updated indexing load.

But how can this be, doesn't everybody use B-Trees in just this way? 
Yes, and everybody is making a big mistake. Let me explain. The big 
issue is write multiplication. Any index that groups entries together in 
blocks will tend to have nearly every block dirty under a random update 
load. How do we transfer all those dirty blocks to cache incrementally, 
efficiently and atomically? We don't, it just cannot be done. In 
practice, we end up writing out most index blocks multiple times due to 
just a few small changes. For example, at the end of a mass update 
create we may find that each block has been written hundreds of times. 
Media transfer latency therefore dominates the operation.

This obvious issue somehow escaped our attention over the entire time 
HTree has been in service. We have occasionally misattributed degraded 
HTree performance to inode table thrashing. To be sure, thrashing at 
high scale is a known problem with Tree, but it is not the biggest 
problem. That would be write multiplication. To fix this, we need to 
step back and adopt a completely different approach.

Dawning of the Light

I am kind of whacking myself on the forehead about this. For an entire 
decade I thought that HTree could be fixed by incremental improvements 
and consequently devoted considerable energy to that effort, the high 
water mark of which was my PHTree post earlier this year:

    http://phunq.net/pipermail/tux3/2013-January/000026.html

The PHTree design is a respectable if uninspired piece of work that 
fixes all the known issues with HTree except for write multiplication, 
which I expected to be pretty easy. Far from it. The issue is 
fundamental to the nature of B-Trees. Though not hitherto recognized in 
the Linux file system community, academics recognized this issue some 
time ago and have been busy hunting for a solution. During one of our 
sushi meetings in the wilds of Mountain View, Kent Overstreet of BCache 
fame pointed me at this work:

    http://www.tokutek.com/2012/12/fractal-tree-indexing-overview/

Such attempts generally fail to get anywhere close to the efficiency 
levels we have become accustomed to with Ext4 and its ilk. But it got me 
thinking along productive lines. (Thank you Kent!) One day the answer 
just hit me like a slow rolling thunderbolt: instead of committing the 
actual B-Tree to disk we should leave it dirty in cache and just log the 
updates to it. This is obviously write-efficient and ACID friendly. It 
is also a poor solution because it sacrifices recovery latency. In the 
event of a crash we need to read the entire log to reconstruct the dirty 
B-Tree, which could take several minutes. During this time, even though 
the raw directory entries are immediately available, the index is 
unavailable. At the scales we are considering, unindexed directory 
access is roughly the same as no access at all.

Birth of Shardmap

Then a faster moving thunderbolt hit me: why not write out the log in 
many small pieces, each covering a part of the hash range? That way, to 
access a single uncached entry we only suffer the latency of loading one 
small piece of the log. With this improvement, the index becomes 
available immediately on restart.

Eureka! Shardmap was born in the form of a secondary hash index on a 
B-Tree that would kick in under heavy load, and be merged back into the 
primary B-Tree when the load eases up. Shardmap would be a "frontend" 
index on the B-Tree, an concept similar to others we have already 
employed in Tux3. This all seemed like a great idea and I immediately 
began to implement a prototype to get an idea of performance.

A few days into my prototype I was hit by a third and final blinding 
flash when I noticed that the primary B-Tree index is not needed at all: 
the in-cache hash tables together with the hash shards logged to media 
constitute a perfectly fine index all on their own. In fact, when I 
investigated further, I found this arrangement to be superior to B-Tree 
approaches in practically every way. (B-Trees still manage to eke out an 
advantage in certain light out of cache loads, which I will not detail 
here.) Shardmap was suddenly elevated from its humble status as 
temporary secondary index to the glorious role of saving all of us from 
HTree's scalability issues. Probably.

Shardmap is a simple and obvious idea, but is that not always the case 
in retrospect? BSD folks came very close to discovering the same thing 
back around the time I was inventing HTree:

    http://en.wikipedia.org/wiki/Dirhash

At the time I recognized the hash table approach as promising, but 
deeply flawed because of its need to load an entire directory just to 
service a single access. Accordingly, HTree continued to be developed 
and went on to rule the world.

I wish I had thought about that harder. All that remained to do to fix 
the BSD Dirhash approach was log the hash table to media efficiently and 
the result would have been Shardmap, many years ago. Well. Better late 
than never.

The remainder of this post takes a deep dive into the proposition that 
Shardmap is a suitable replacement for HTree, potentially suitable for 
use in Tux3, Ext4, and Lustre.

Design

A Shardmap index consists of a scalable number of index shards, starting 
at one for the smallest indexed directories and increasing to 4096 for a 
billion file directory. The maximum size of each shard also increases 
over this range from 64K to four megabytes. Each shard contains index 
entries for some subset of the hash key range. Each shard entry maps a 
hash key to the logical block number of a directory entry block known to 
contain a name that hashes to that key.

Each shard is represented as an unsorted fifo on disk and a small hash 
table in memory. To search for a name, we use some high bits of the hash 
to look up a cached shard hash in memory. If the shard is not hashed in 
cache, then we load the corresponding shard fifo from media and convert 
it to into a hash table. We walk this list of hash collisions, searching 
each referenced directory block for a match on the name.

Clearly, hash collisions must be rare in order to avoid searching 
multiple, potentially out of cache directory entry blocks. Therefore, we 
compute and store our hash keys at a higher precision than our targeted 
scalability range.

As a directory grows, we scale the shardmap in two ways: 1) Rehash a 
cached shard to a larger number of hash buckets and 2) Reshard a stored 
shard fifo to divide it into multiple, smaller shards. These operations 
are staggered to avoid latency spikes. The reshard operation imposes a 
modest degree of write multiplication on the Shardmap design, 
asymptotically approaching a factor of two. This is far better than the 
factor of several hundred we see with HTree.

The key ideas of Shardmap are: 1) the representation of directory data 
is not the same on media as it is in cache. On media we have fifos, but 
in cache we have hash tables. 2) Updating a fifo is cache efficient. 
Only the tail block of the fifo needs to be present in cache. The cache 
footprint of the media image of a shardmap is therefore just one disk 
block per shard. 3) A small fifo on media is easily loaded and converted 
to an efficient hash table shard on demand. Once in cache, index updates 
are performed by updating the cached hash table and appending the same 
entries to the final block of the shard fifo.

To record deletes durably, Shardmap appends negative fifo entries to 
shard fifos. From time to time, a shard fifo containing delete entries 
will be compacted and rewritten in its entirety to prevent unbounded 
growth. This cleanup operation actually turns out to be the most complex 
bit of code in Shardmap, and it is not very complex.

Like any other kind of Tux3 update, Shardmap updates are required to be 
ACID. The Tux3 block forking mechanism makes this easy: each delta 
transition effectively makes all dirty blocks of a directory read-only. 
While transferring a previous delta to disk, directory entries may be 
created in or removed from directory entry blocks on their way to disk, 
so those blocks will be forked. The tail block of a shard fifo may also 
be forked, and so may directory entry free maps. Under a pure create or 
delete load, the additional cache load caused by page forking will 
normally not be much more than the tail blocks of shard fifos. Other 
loads will perform about as you would expect them to.

The cache footprint of an actively updated Shardmap index is necessarily 
the entire index, true not only of Shardmap, but any randomly updated 
index that groups entries together in blocks. If we contemplate running 
a create test on a billion files, we must provide enough cache to 
accommodate the entire index or we will thrash, it is as simple as that. 
For a billion files we will need about eight gigabytes of cache. That is 
actually not too bad, and reflects Shardmap's compact hash table design. 
Less cache than that will force more disk accesses, and the test will 
consequently run slower but correctly.

Shardmap's appetite for shard cache grows to extreme levels as 
directories grow to billions of files. This is not unexpected, however 
it does mean that we need to take some special care with our kernel 
cache design. A shard hashtable in kernel will be an expanding array of 
pages just as it is in userspace, however we will not have the virtual 
memory hardware available to help us out here[2]. On 32 bit hardware we 
will be using highmem for the cache, with attendant inefficient 
kmap/unmap operations and that will suck, but it will still work better 
than HTree for gigantic directories. For smaller directories we can 
adopt some other cache management strategy in order to avoid performance 
regressions.

Comparison to HTree

Like HTree, Shardmap uses tables of fixed size keys that are hashes of 
names in order to rapidly locate some directory block containing 
standard directory entries.

Unlike HTree, Shardmap has one index entry per directory entry. HTree 
uses one index entry per directory entry block, and is unique in that 
regard. This is one of the key design details that has made HTree so 
hard to beat all this time. It also sounds like a big advantage over 
Shardmap in terms of index size, but actually it is a tie because of 
slack space normally found in B-Tree blocks.

Like HTree, a Shardmap index is mapped logically into directory file 
data blocks (logical metadata in Tux3 parlance). This takes advantage of 
the physical/logical layering of classic Unix filesystem design. In 
concrete terms, it reuses the same cache operations for the index as are 
already implemented for ordinary files and avoids complicating the Tux3 
physical layer at which files and the inode table are defined. It also 
provides a degree of modularity that is aesthetically pleasing in itself.

Unlike HTree, a Shardmap index is not interleaved with directory data. 
We place the index data strictly above the directory entry blocks 
because the index is relatively larger than an HTree index, roughly 20% 
of the directory file. At first we intended to place the Shardmap index 
at a very high logical address, but that plan as discarded when we 
noticed that this adds several levels to the page cache radix tree, 
which slows down all directory block accesses significantly (we measured 
6 nanoseconds per radix tree level).

Our refined layout scheme places the Shardmap index a short distance 
above the currently used directory blocks and relocates it higher as the 
directory expands, so we incur about the same radix tree overhead as 
would be required by a simple unindexed directory[1]. This relocation 
does not impose new overhead because we already must relocate the index 
shards as a directory expands, to break them up and limit reload latency.

Unlike HTree, Shardmap needs to keep track of holes in directory entry 
blocks created by deletions. HTree finesses this detail away by creating 
each new entry at a particular place in the btree corresponding to its 
hash; deleted entry records are simply forgotten in the hope that they 
will eventually be compressed away during a block split operation.

Shardmap employs a free record map for this purpose, with one byte per
directory entry block that indicates the largest directory record 
available in the directory block. To avoid churning this map, it is 
updated lazily - the actual largest free record is never larger than the 
size stored in the map, but may be smaller. If so, a failed search for 
free space in the block will update the free map entry to the actual 
largest size. Conversely, on delete the map is updated to be an 
overestimate of the largest record size. The intention of these
heuristics is to reduce the amount of accounting data that needs to be 
written out under sparse update loads.

At scales of billions of entries, even scanning a byte per directory 
entry block is too expensive, so Shardmap goes on to map the free map 
with an additional map layer where each byte gives the size of the 
largest free directory record covered by the associated free map block. 
Three levels of this structure maps 2**(12*3) = 2**36 directory blocks, 
which should be sufficient for the foreseeable future.

Shardmap uses the elegant Siphash general purpose hash function 
obligingly provided by Aumasson and Bernstein:

    https://131002.net/siphash/

Siphash is a wonderful creation that is easily parametrized in terms of 
mixing rounds to the degree required by the application. Shardmap is not 
as demanding in this regard as HTree, which does not implement delete 
coalescing and is therefore sensitive to slight hash distribution 
anomalies. Shardmap is able to tolerate relatively fewer mixing rounds 
in the interest of higher performance.

Siphash as implemented by Shardmap uses 2 rounds per 8 bytes versus 
HTree's default Halfmd4 which uses 24 more complex rounds per 32 bytes, 
a significant performance win for Shardmap:

    http://lxr.free-electrons.com/source/lib/halfmd4.c#L25

https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c#L60

Further, the Siphash dispersal pattern is specifically optimized for 
hash table applications. Our recommendation is that existing HTree users 
add Siphash as the default hash function in the interest of improved 
performance.

Performance

With the Shardmap userspace prototype we were able to obtain some early
performance numbers, which I will describe in general terms with precise 
details to follow. Roughly speaking, both create and delete run at 
around 150 nanoseconds per operation, and do not appear to degrade 
significantly as directory size increases into the hundreds of millions. 
As expected, lookup operations are faster, roughly 100 nanoseconds each. 
This is with a modestly specced workstation and a consumer grade SSD, so 
spinning disk seek performance is not yet being measured.

In general, performance numbers obtained so far match HTree at modest 
scale, definitively dominate at higher scales in the millions of files, 
and continue on up into scales orders of magnitude higher than HTree can 
even attempt.

Directory traversal is in file order for Shardmap, compared to hash 
order for HTree, which entails a computationally intensive procedure of 
sorting and caching intermediate search results required for correct 
directory cookie semantics, a clear win for Shardmap. The effect of 
avoiding HTree's inode table thrashing behaviour has not yet been 
measured (this must wait for a kernel implementation) however it is safe 
to assume that this will be a clear win for Shardmap as well.

Implementation

A nearly complete userspace prototype including unit tests is now 
available in the Tux3 repository:

    https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c

We encourage interested observers to compile and run this standalone 
code in order to verify our performance claims. It is also worth seeing 
for yourself just how simple this code is.

The prototype currently lacks the following two important elements:

    * The reshard operation. So far not tested, so we cannot speak to
      the performance impact of reshard just yet, other than from
      estimates that indicate it is small.

    * Free record mapping. Likewise, this will add some additional,
      modest update overhead that we have not yet measured, only
      estimated.

To be useful, Shardmap needs a kernel implementation, which we plan to 
defer until after merging with mainline. Scalable directories are 
certainly nice, but not essential for evaluating the fitness of Tux3 as 
a filesystem in the context of personal or light server use. I will 
comment further on the design of the kernel implementation in a later post.

Summary

Shardmap is a new directory index design created expressly for the Tux3
filesystem, but holds promise in application areas beyond that as an 
upgrade to existing filesystems and probably also being applicable to 
database design. Shardmap arguably constitutes a breakthrough in 
performance, scalability and simplicity in the arcane field of directory 
indexing, solving a number of well known and notoriously difficult 
performance problems as it does.

We expect a kernel implementation of Shardmap to land some time in the 
next few months. In the mean time, we are now satisfied with this key 
aspect of the Tux3 design and will turn our attention back to the 
handful of remaining issues that need to be addressed before offering 
Tux3 for mainline kernel merge.

Regards,

Daniel

[1] This arguably indicates a flaw in the classic radix tree design, 
possibly correctable to work better with sparse files.

[2] Maybe we should allow kernel modules to own and operate their own 
virtual address tables, that is another story.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-04-15 18:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-20 17:02 Tux3 Report: Meet Shardmap, the designated successor of HTree Daniel Phillips
2013-06-20 19:11 ` Christian Stroetmann
2013-06-20 20:27   ` Daniel Phillips
2015-04-15 18:00     ` Christian Stroetmann
2013-06-24 14:18   ` Pavel Machek
2013-06-24 15:16     ` Christian Stroetmann
2013-06-24 15:39       ` Andreas Karlsson
2013-06-24 17:19         ` Christian Stroetmann
2013-06-24 18:20           ` richard -rw- weinberger
2013-06-25  0:34             ` Christian Stroetmann
2013-06-25  0:15               ` David Lang
  -- strict thread matches above, loose matches on Subject: below --
2013-06-19  2:31 Daniel Phillips
2013-06-19 18:17 ` Christian Stroetmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).