linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* File system for scratch space (in HPC cluster)
@ 2019-10-24 10:43 Paul Menzel
  2019-10-24 14:55 ` Theodore Y. Ts'o
  2019-10-24 17:51 ` Andreas Dilger
  0 siblings, 2 replies; 7+ messages in thread
From: Paul Menzel @ 2019-10-24 10:43 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Linux Kernel Mailing List, Donald Buczek

[-- Attachment #1: Type: text/plain, Size: 413 bytes --]

Dear Linux folks,


In our cluster, we offer scratch space for temporary files. As
these files are temporary, we do not need any safety
requirements – especially not those when the system crashes or
shuts down. So no `sync` is for example needed.

Are there file systems catering to this need? I couldn’t find
any? Maybe I missed some options for existing file systems.


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 10:43 File system for scratch space (in HPC cluster) Paul Menzel
@ 2019-10-24 14:55 ` Theodore Y. Ts'o
  2019-10-24 15:01   ` Boaz Harrosh
  2019-10-24 17:51 ` Andreas Dilger
  1 sibling, 1 reply; 7+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-24 14:55 UTC (permalink / raw)
  To: Paul Menzel; +Cc: linux-fsdevel, Linux Kernel Mailing List, Donald Buczek

On Thu, Oct 24, 2019 at 12:43:40PM +0200, Paul Menzel wrote:
> 
> In our cluster, we offer scratch space for temporary files. As
> these files are temporary, we do not need any safety
> requirements – especially not those when the system crashes or
> shuts down. So no `sync` is for example needed.
> 
> Are there file systems catering to this need? I couldn’t find
> any? Maybe I missed some options for existing file systems.

You could use ext4 in nojournal mode.  If you want to make sure that
fsync() doesn't force a cache flush, you can mount with the nobarrier
mount option.

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 14:55 ` Theodore Y. Ts'o
@ 2019-10-24 15:01   ` Boaz Harrosh
  2019-10-24 20:34     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Boaz Harrosh @ 2019-10-24 15:01 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Paul Menzel
  Cc: linux-fsdevel, Linux Kernel Mailing List, Donald Buczek

On 24/10/2019 17:55, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 12:43:40PM +0200, Paul Menzel wrote:
>>
>> In our cluster, we offer scratch space for temporary files. As
>> these files are temporary, we do not need any safety
>> requirements – especially not those when the system crashes or
>> shuts down. So no `sync` is for example needed.
>>
>> Are there file systems catering to this need? I couldn’t find
>> any? Maybe I missed some options for existing file systems.
> 
> You could use ext4 in nojournal mode.  If you want to make sure that
> fsync() doesn't force a cache flush, you can mount with the nobarrier
> mount option.
> 

And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.

I think xfs for O_TMPFILE|O_EXCL does not do any fsync, but I'm
not sure

> 					- Ted
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 10:43 File system for scratch space (in HPC cluster) Paul Menzel
  2019-10-24 14:55 ` Theodore Y. Ts'o
@ 2019-10-24 17:51 ` Andreas Dilger
  2019-10-25  8:35   ` Paul Menzel
  1 sibling, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2019-10-24 17:51 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Linux FS-devel Mailing List, Linux Kernel Mailing List, Donald Buczek

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

On Oct 24, 2019, at 4:43 AM, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Linux folks,
> 
> 
> In our cluster, we offer scratch space for temporary files. As
> these files are temporary, we do not need any safety
> requirements – especially not those when the system crashes or
> shuts down. So no `sync` is for example needed.
> 
> Are there file systems catering to this need? I couldn’t find
> any? Maybe I missed some options for existing file systems.

How big do you need the scratch filesystem to be?  Is it local
to the node or does it need to be shared between nodes?  If it
needs to be large and shared between nodes then Lustre is typically
used for this.  If it is local and relatively small you could
consider using tmpfs backed by swab on an NVMe flash device
(M.2 or U.2, Optane if you can afford it) inside the node.

That way you get RAM-like performance for many files, with a
larger capacity than RAM when needed (tmpfs can use swap).

You might consider to mount a new tmpfs filesystem per job (no
formatting is needed for tmpfs), and then unmount it when the job
is done, so that the old files are automatically cleaned up.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 15:01   ` Boaz Harrosh
@ 2019-10-24 20:34     ` Theodore Y. Ts'o
  2019-10-25  8:33       ` Paul Menzel
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-24 20:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Paul Menzel, linux-fsdevel, Linux Kernel Mailing List, Donald Buczek

On Thu, Oct 24, 2019 at 06:01:05PM +0300, Boaz Harrosh wrote:
> > You could use ext4 in nojournal mode.  If you want to make sure that
> > fsync() doesn't force a cache flush, you can mount with the nobarrier
> > mount option.
> 
> And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.

O_TMPFILE means that there is no directory entry created.  The
pathname passed to the open system call is the directory specifying
the file system where the temporary file will be created.

This may or may not be what the original poster wanted, depending on
whether by "scratch file" he meant a file which could be opened by
pathname by another, subsequent process or not.

    	 	      	  	    	 - Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 20:34     ` Theodore Y. Ts'o
@ 2019-10-25  8:33       ` Paul Menzel
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Menzel @ 2019-10-25  8:33 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Boaz Harrosh
  Cc: linux-fsdevel, Linux Kernel Mailing List, Donald Buczek

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

Dear Boaz, dear Theodore,


Thank you for your replies.

On 2019-10-24 22:34, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 06:01:05PM +0300, Boaz Harrosh wrote:
>>> You could use ext4 in nojournal mode.  If you want to make sure that
>>> fsync() doesn't force a cache flush, you can mount with the nobarrier
>>> mount option.

Yeah, that is the current settings we use.

>> And open the file with O_TMPFILE|O_EXCL so there is no metadata as well.
> 
> O_TMPFILE means that there is no directory entry created.  The
> pathname passed to the open system call is the directory specifying
> the file system where the temporary file will be created.

Interesting.

The main problem is, that we can’t control what the users put into the
cluster, so a mount option is needed.

> This may or may not be what the original poster wanted, depending on
> whether by "scratch file" he meant a file which could be opened by
> pathname by another, subsequent process or not.

Yeah, the scientists send often scripts, where they access the files
by subsequent processes.


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: File system for scratch space (in HPC cluster)
  2019-10-24 17:51 ` Andreas Dilger
@ 2019-10-25  8:35   ` Paul Menzel
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Menzel @ 2019-10-25  8:35 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linux FS-devel Mailing List, Linux Kernel Mailing List, Donald Buczek

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

Dear Andreas,


On 2019-10-24 19:51, Andreas Dilger wrote:
> On Oct 24, 2019, at 4:43 AM, Paul Menzel <pmenzel@molgen.mpg.de> 
> wrote:

>> In our cluster, we offer scratch space for temporary files. As 
>> these files are temporary, we do not need any safety requirements
>> – especially not those when the system crashes or shuts down. So
>> no `sync` is for example needed.
>> 
>> Are there file systems catering to this need? I couldn’t find any? 
>> Maybe I missed some options for existing file systems.
> 
> How big do you need the scratch filesystem to be?  Is it local to
> the node or does it need to be shared between nodes?

In this case local.

> If it needs to be large and shared between nodes then Lustre is 
> typically used for this.  If it is local and relatively small you 
> could consider using tmpfs backed by swab on an NVMe flash device 
> (M.2 or U.2, Optane if you can afford it) inside the node.
> 
> That way you get RAM-like performance for many files, with a larger 
> capacity than RAM when needed (tmpfs can use swap).
> 
> You might consider to mount a new tmpfs filesystem per job (no 
> formatting is needed for tmpfs), and then unmount it when the job is 
> done, so that the old files are automatically cleaned up.
That is a good idea, but probably not practical for 10 TB. Out of
curiosity, what is the limit for “relatively small” in your
experience?


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-10-25  8:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 10:43 File system for scratch space (in HPC cluster) Paul Menzel
2019-10-24 14:55 ` Theodore Y. Ts'o
2019-10-24 15:01   ` Boaz Harrosh
2019-10-24 20:34     ` Theodore Y. Ts'o
2019-10-25  8:33       ` Paul Menzel
2019-10-24 17:51 ` Andreas Dilger
2019-10-25  8:35   ` Paul Menzel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).