lustre-devel-lustre.org archive mirror
 help / color / mirror / Atom feed
* [lustre-devel] modern precreate
@ 2021-01-08 19:44 Nathan Rutman
  2021-01-09 15:56 ` Andreas Dilger
  0 siblings, 1 reply; 2+ messages in thread
From: Nathan Rutman @ 2021-01-08 19:44 UTC (permalink / raw)
  To: lustre-devel


[-- Attachment #1.1: Type: text/plain, Size: 2325 bytes --]

Riffing on something Andreas said in a lustre-discuss thread, I'm
hoping someone can correct my understanding of how precreate works
currently.

Olden days:

MDS would ask each OST for a set of precreated objects via a MDT->OST
RPC. These have to be cleaned up during recovery, hence a cap. These
were used up as MDS assigned them to layouts, and so MDS has to go
back and get more, even for 0-length files.

Modern days, Lustre 2.5+:

MDT doesn't hold a pool of OST objects but instead takes an OST fid
range from a FLD server instead. Each MD object has a mapping with an
eventual OST object by this fid. The OST side just holds a small
number of anonymous objects and assigns the fid to an object when any
operation is executed without an existing FID->inode mapping on the
OST.There is no more precreate RPC necessary, since OSTs maintain
their own pool of anonymous objects and only use them up when data is
actually written, and can create more when running low. There is no
recovery cleanup needed on the OSTs.
In this case, there should be no performance difference between create
and mknod except for the FLD operation, and the number of OSTs should
not matter for create rates.

Is my understanding wrong? It clearly must be, since Andreas is still
talking OST_CREATE rpc and recovery implications, and we do see a
performance difference with mknod and creating files with layouts.

[lustre-discuss] Improving file create performance with larger create_count)

The max_create_count is between 32 and 20000 (for protocol recovery
reasons, since unused precreated objects are destroyed during
recovery, and we put a cap on how many objects could be destroyed to
avoid badness in case of a bug) so this is already at the maximum.
You should be able to increase the create_count to 20000 as well.
However, this value is "auto tuned" based on how long it takes the OSS
to create the requested objects.  If the OST_CREATE RPC takes too long
then the MDS will ask for fewer objects next time.
> * Is there a theoretical down side to pre-creating more objects?  (MDS or OSS memory usage?  Longer mount times? slower e2fsck?)
> A bit slower e2fsck, but compared to the total filesystem size this is minor.  The biggest issue is that the old precreated objects will be destroyed during MDS-OSS recovery and new ones created.

[-- Attachment #1.2: Type: text/html, Size: 4794 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [lustre-devel] modern precreate
  2021-01-08 19:44 [lustre-devel] modern precreate Nathan Rutman
@ 2021-01-09 15:56 ` Andreas Dilger
  0 siblings, 0 replies; 2+ messages in thread
From: Andreas Dilger @ 2021-01-09 15:56 UTC (permalink / raw)
  To: Nathan Rutman; +Cc: lustre-devel


[-- Attachment #1.1: Type: text/plain, Size: 2817 bytes --]

On Jan 8, 2021, at 12:44, Nathan Rutman <nrutman@gmail.com<mailto:nrutman@gmail.com>> wrote:


Riffing on something Andreas said in a lustre-discuss thread, I'm hoping someone can correct my understanding of how precreate works currently.

Olden days:

MDS would ask each OST for a set of precreated objects via a MDT->OST RPC. These have to be cleaned up during recovery, hence a cap. These were used up as MDS assigned them to layouts, and so MDS has to go back and get more, even for 0-length files.

Modern days, Lustre 2.5+:

MDT doesn't hold a pool of OST objects but instead takes an OST fid range from a FLD server instead. Each MD object has a mapping with an eventual OST object by this fid. The OST side just holds a small number of anonymous objects and assigns the fid to an object when any operation is executed without an existing FID->inode mapping on the OST.There is no more precreate RPC necessary, since OSTs maintain their own pool of anonymous objects and only use them up when data is actually written, and can create more when running low. There is no recovery cleanup needed on the OSTs.
In this case, there should be no performance difference between create and mknod except for the FLD operation, and the number of OSTs should not matter for create rates.

Is my understanding wrong? It clearly must be, since Andreas is still talking OST_CREATE rpc and recovery implications, and we do see a performance difference with mknod and creating files with layouts.

The precreate code still works the same as "the land before the time of FIDs".  Actual objects are still precreated/destroyed on the OSTs. The only difference is that the FID sequences allocated to MDTs allow the OSTs to have different pools of objects for each MDT so that they don't contend/conflict when those MDTs assign the objects to their own inodes.

Having multiple MDTs does "scale" the OST object space, in that there can be more object subdirectories (one per sequence), which improves both the concurrency and the maximum number of objects.  There has also been work done to increase the maximum number of files per directory in ldiskfs, but that doesn't really improve performance.

The patch https://review.whamcloud.com/38424 "LU-11912<https://jira.whamcloud.com/browse/LU-11912> ofd: reduce LUSTRE_DATA_SEQ_MAX_WIDTH" would create smaller object directory trees, and allow "aging" of old objects to be in separate object directory trees from new objects.  That allows old objects to drop out of cache (avoiding one-create-per-leaf as the size of the directory grows very lareg), and keeps fewer "hot" objects densely packed in memory (allowing many new entries to be packed into a single leaf block).

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud







[-- Attachment #1.2: Type: text/html, Size: 7271 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-01-09 15:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 19:44 [lustre-devel] modern precreate Nathan Rutman
2021-01-09 15:56 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).