aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst56
-rw-r--r--Documentation/filesystems/affs.rst16
-rw-r--r--Documentation/filesystems/afs.rst14
-rw-r--r--Documentation/filesystems/api-summary.rst15
-rw-r--r--Documentation/filesystems/autofs-mount-control.rst8
-rw-r--r--Documentation/filesystems/autofs.rst4
-rw-r--r--Documentation/filesystems/automount-support.rst (renamed from Documentation/filesystems/automount-support.txt)23
-rw-r--r--Documentation/filesystems/befs.rst4
-rw-r--r--Documentation/filesystems/btrfs.rst17
-rw-r--r--Documentation/filesystems/caching/backend-api.rst479
-rw-r--r--Documentation/filesystems/caching/backend-api.txt726
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst (renamed from Documentation/filesystems/caching/cachefiles.txt)321
-rw-r--r--Documentation/filesystems/caching/fscache.rst348
-rw-r--r--Documentation/filesystems/caching/fscache.txt448
-rw-r--r--Documentation/filesystems/caching/index.rst12
-rw-r--r--Documentation/filesystems/caching/netfs-api.rst452
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt910
-rw-r--r--Documentation/filesystems/caching/object.txt320
-rw-r--r--Documentation/filesystems/caching/operations.txt213
-rw-r--r--Documentation/filesystems/ceph.rst42
-rw-r--r--Documentation/filesystems/coda.rst1670
-rw-r--r--Documentation/filesystems/coda.txt1676
-rw-r--r--Documentation/filesystems/configfs.rst (renamed from Documentation/filesystems/configfs/configfs.txt)175
-rw-r--r--Documentation/filesystems/dax.rst307
-rw-r--r--Documentation/filesystems/dax.txt132
-rw-r--r--Documentation/filesystems/debugfs.rst31
-rw-r--r--Documentation/filesystems/devpts.rst36
-rw-r--r--Documentation/filesystems/devpts.txt26
-rw-r--r--Documentation/filesystems/directory-locking.rst343
-rw-r--r--Documentation/filesystems/dlmfs.rst2
-rw-r--r--Documentation/filesystems/dnotify.rst (renamed from Documentation/filesystems/dnotify.txt)13
-rw-r--r--Documentation/filesystems/efivarfs.rst17
-rw-r--r--Documentation/filesystems/erofs.rst326
-rw-r--r--Documentation/filesystems/ext2.rst5
-rw-r--r--Documentation/filesystems/ext4/about.rst2
-rw-r--r--Documentation/filesystems/ext4/attributes.rst68
-rw-r--r--Documentation/filesystems/ext4/bigalloc.rst2
-rw-r--r--Documentation/filesystems/ext4/bitmaps.rst6
-rw-r--r--Documentation/filesystems/ext4/blockgroup.rst38
-rw-r--r--Documentation/filesystems/ext4/blockmap.rst2
-rw-r--r--Documentation/filesystems/ext4/blocks.rst2
-rw-r--r--Documentation/filesystems/ext4/checksums.rst26
-rw-r--r--Documentation/filesystems/ext4/directory.rst187
-rw-r--r--Documentation/filesystems/ext4/eainode.rst10
-rw-r--r--Documentation/filesystems/ext4/globals.rst1
-rw-r--r--Documentation/filesystems/ext4/group_descr.rst126
-rw-r--r--Documentation/filesystems/ext4/ifork.rst60
-rw-r--r--Documentation/filesystems/ext4/inlinedata.rst8
-rw-r--r--Documentation/filesystems/ext4/inodes.rst316
-rw-r--r--Documentation/filesystems/ext4/journal.rst374
-rw-r--r--Documentation/filesystems/ext4/mmp.rst36
-rw-r--r--Documentation/filesystems/ext4/orphan.rst42
-rw-r--r--Documentation/filesystems/ext4/overview.rst2
-rw-r--r--Documentation/filesystems/ext4/special_inodes.rst19
-rw-r--r--Documentation/filesystems/ext4/super.rst566
-rw-r--r--Documentation/filesystems/ext4/verity.rst3
-rw-r--r--Documentation/filesystems/f2fs.rst622
-rw-r--r--Documentation/filesystems/fiemap.rst (renamed from Documentation/filesystems/fiemap.txt)143
-rw-r--r--Documentation/filesystems/files.rst (renamed from Documentation/filesystems/files.txt)66
-rw-r--r--Documentation/filesystems/fscrypt.rst522
-rw-r--r--Documentation/filesystems/fsverity.rst470
-rw-r--r--Documentation/filesystems/fuse-io.rst (renamed from Documentation/filesystems/fuse-io.txt)9
-rw-r--r--Documentation/filesystems/fuse.rst31
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst (renamed from Documentation/filesystems/gfs2-glocks.txt)154
-rw-r--r--Documentation/filesystems/gfs2.rst37
-rw-r--r--Documentation/filesystems/hfs.rst2
-rw-r--r--Documentation/filesystems/hpfs.rst2
-rw-r--r--Documentation/filesystems/idmappings.rst1039
-rw-r--r--Documentation/filesystems/index.rst27
-rw-r--r--Documentation/filesystems/journalling.rst97
-rw-r--r--Documentation/filesystems/locking.rst313
-rw-r--r--Documentation/filesystems/locks.rst (renamed from Documentation/filesystems/locks.txt)29
-rw-r--r--Documentation/filesystems/mandatory-locking.txt181
-rw-r--r--Documentation/filesystems/mount_api.rst (renamed from Documentation/filesystems/mount_api.txt)341
-rw-r--r--Documentation/filesystems/netfs_library.rst595
-rw-r--r--Documentation/filesystems/nfs/client-identifier.rst216
-rw-r--r--Documentation/filesystems/nfs/exporting.rst94
-rw-r--r--Documentation/filesystems/nfs/index.rst3
-rw-r--r--Documentation/filesystems/nfs/reexport.rst113
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.rst2
-rw-r--r--Documentation/filesystems/nfs/rpc-server-gss.rst11
-rw-r--r--Documentation/filesystems/nilfs2.rst2
-rw-r--r--Documentation/filesystems/ntfs3.rst123
-rw-r--r--Documentation/filesystems/ocfs2.rst2
-rw-r--r--Documentation/filesystems/omfs.rst2
-rw-r--r--Documentation/filesystems/orangefs.rst6
-rw-r--r--Documentation/filesystems/overlayfs.rst363
-rw-r--r--Documentation/filesystems/path-lookup.rst230
-rw-r--r--Documentation/filesystems/path-lookup.txt2
-rw-r--r--Documentation/filesystems/porting.rst312
-rw-r--r--Documentation/filesystems/proc.rst666
-rw-r--r--Documentation/filesystems/qnx6.rst4
-rw-r--r--Documentation/filesystems/quota.rst (renamed from Documentation/filesystems/quota.txt)53
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.rst15
-rw-r--r--Documentation/filesystems/seq_file.rst (renamed from Documentation/filesystems/seq_file.txt)91
-rw-r--r--Documentation/filesystems/sharedsubtree.rst (renamed from Documentation/filesystems/sharedsubtree.txt)402
-rw-r--r--Documentation/filesystems/smb/cifsroot.rst (renamed from Documentation/filesystems/cifs/cifsroot.txt)58
-rw-r--r--Documentation/filesystems/smb/index.rst10
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst186
-rw-r--r--Documentation/filesystems/spufs/index.rst13
-rw-r--r--Documentation/filesystems/spufs/spu_create.rst131
-rw-r--r--Documentation/filesystems/spufs/spu_run.rst138
-rw-r--r--Documentation/filesystems/spufs/spufs.rst (renamed from Documentation/filesystems/spufs.txt)306
-rw-r--r--Documentation/filesystems/squashfs.rst60
-rw-r--r--Documentation/filesystems/sysfs-pci.txt131
-rw-r--r--Documentation/filesystems/sysfs-tagging.txt42
-rw-r--r--Documentation/filesystems/sysfs.rst60
-rw-r--r--Documentation/filesystems/tmpfs.rst121
-rw-r--r--Documentation/filesystems/ubifs-authentication.rst12
-rw-r--r--Documentation/filesystems/ubifs.rst2
-rw-r--r--Documentation/filesystems/udf.rst2
-rw-r--r--Documentation/filesystems/vfat.rst4
-rw-r--r--Documentation/filesystems/vfs.rst428
-rw-r--r--Documentation/filesystems/virtiofs.rst14
-rw-r--r--Documentation/filesystems/xfs/index.rst14
-rw-r--r--Documentation/filesystems/xfs/xfs-delayed-logging-design.rst (renamed from Documentation/filesystems/xfs-delayed-logging-design.txt)430
-rw-r--r--Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst194
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst5315
-rw-r--r--Documentation/filesystems/xfs/xfs-self-describing-metadata.rst (renamed from Documentation/filesystems/xfs-self-describing-metadata.txt)201
-rw-r--r--Documentation/filesystems/zonefs.rst89
120 files changed, 17883 insertions, 8553 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 671fef39a802..1e0e0bb6fdf9 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -18,7 +18,7 @@ and Maya Gokhale. Additional development by Greg Watson
The best detailed explanation of the Linux implementation and applications of
the 9p client is available in the form of a USENIX paper:
- http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
+ https://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
Other applications are described in the following papers:
@@ -78,19 +78,39 @@ Options
offering several exported file systems.
cache=mode specifies a caching policy. By default, no caches are used.
-
- none
- default no cache policy, metadata and data
- alike are synchronous.
- loose
- no attempts are made at consistency,
- intended for exclusive, read-only mounts
- fscache
- use FS-Cache for a persistent, read-only
- cache backend.
- mmap
- minimal cache that is only used for read-write
- mmap. Northing else is cached, like cache=none
+ The mode can be specified as a bitmask or by using one of the
+ preexisting common 'shortcuts'.
+ The bitmask is described below: (unspecified bits are reserved)
+
+ ========== ====================================================
+ 0b00000000 all caches disabled, mmap disabled
+ 0b00000001 file caches enabled
+ 0b00000010 meta-data caches enabled
+ 0b00000100 writeback behavior (as opposed to writethrough)
+ 0b00001000 loose caches (no explicit consistency with server)
+ 0b10000000 fscache enabled for persistent caching
+ ========== ====================================================
+
+ The current shortcuts and their associated bitmask are:
+
+ ========= ====================================================
+ none 0b00000000 (no caching)
+ readahead 0b00000001 (only read-ahead file caching)
+ mmap 0b00000101 (read-ahead + writeback file cache)
+ loose 0b00001111 (non-coherent file and meta-data caches)
+ fscache 0b10001111 (persistent loose cache)
+ ========= ====================================================
+
+ NOTE: only these shortcuts are tested modes of operation at the
+ moment, so using other combinations of bit-patterns is not
+ known to work. Work on better cache support is in progress.
+
+ IMPORTANT: loose caches (and by extension at the moment fscache)
+ do not necessarily validate cached values on the server. In other
+ words changes on the server are not guaranteed to be reflected
+ on the client system. Only use this mode of operation if you
+ have an exclusive mount and the server will modify the filesystem
+ underneath you.
debug=n specifies debug level. The debug level is a bitmask.
@@ -137,6 +157,12 @@ Options
This can be used to share devices/named pipes/sockets between
hosts. This functionality will be expanded in later versions.
+ directio bypass page cache on all read/write operations
+
+ ignoreqv ignore qid.version==0 as a marker to ignore cache
+
+ noxattr do not offer xattr functions on this mount.
+
access there are four access modes.
user
if a user tries to access a file on v9fs
@@ -192,4 +218,4 @@ For more information on the Plan 9 Operating System check out
http://plan9.bell-labs.com/plan9
For information on Plan 9 from User Space (Plan 9 applications and libraries
-ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
+ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/
diff --git a/Documentation/filesystems/affs.rst b/Documentation/filesystems/affs.rst
index 7f1a40dce6d3..5776cbd5fa53 100644
--- a/Documentation/filesystems/affs.rst
+++ b/Documentation/filesystems/affs.rst
@@ -110,13 +110,15 @@ The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
- R maps to r for user, group and others. On directories, R implies x.
- - If both W and D are allowed, w will be set.
+ - W maps to w.
- E maps to x.
- - H and P are always retained and ignored under Linux.
+ - D is ignored.
- - A is always reset when a file is written to.
+ - H, S and P are always retained and ignored under Linux.
+
+ - A is cleared when a file is written to.
User id and group id will be used unless set[gu]id are given as mount
options. Since most of the Amiga file systems are single user systems
@@ -128,11 +130,13 @@ Linux -> Amiga:
The Linux rwxrwxrwx file mode is handled as follows:
- - r permission will set R for user, group and others.
+ - r permission will allow R for user, group and others.
+
+ - w permission will allow W for user, group and others.
- - w permission will set W and D for user, group and others.
+ - x permission of the user will allow E for plain files.
- - x permission of the user will set E for plain files.
+ - D will be allowed for user, group and others.
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst
index c4ec39a5966e..f15ba388bbde 100644
--- a/Documentation/filesystems/afs.rst
+++ b/Documentation/filesystems/afs.rst
@@ -44,7 +44,7 @@ options::
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
- CONFIG_AFS - The AFS filesystem
+ CONFIG_AFS_FS - The AFS filesystem
Additionally, the following can be turned on to aid debugging::
@@ -70,7 +70,7 @@ list of volume location server IP addresses::
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
- Documentation/networking/rxrpc.txt
+ Documentation/networking/rxrpc.rst
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
@@ -109,7 +109,7 @@ Mountpoints
AFS has a concept of mountpoints. In AFS terms, these are specially formatted
symbolic links (of the same form as the "device name" passed to mount). kAFS
presents these to the user as directories that have a follow-link capability
-(ie: symbolic link semantics). If anyone attempts to access them, they will
+(i.e.: symbolic link semantics). If anyone attempts to access them, they will
automatically cause the target volume to be mounted (if possible) on that site.
Automatically mounted filesystems will be automatically unmounted approximately
@@ -144,7 +144,7 @@ looks up a cell of the same name, for example::
Proc Filesystem
===============
-The AFS modules creates a "/proc/fs/afs/" directory and populates it:
+The AFS module creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and
their usage counts::
@@ -190,7 +190,7 @@ Security
Secure operations are initiated by acquiring a key using the klog program. A
very primitive klog program is available at:
- http://people.redhat.com/~dhowells/rxrpc/klog.c
+ https://people.redhat.com/~dhowells/rxrpc/klog.c
This should be compiled by::
@@ -201,7 +201,7 @@ And then run as::
./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service
-and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
+and cell, e.g.: "afs@<cellname>". This can be viewed with the keyctl program or
by cat'ing /proc/keys::
[root@andromeda ~]# keyctl show
@@ -211,7 +211,7 @@ by cat'ing /proc/keys::
111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM
Currently the username, realm, password and proposed ticket lifetime are
-compiled in to the program.
+compiled into the program.
It is not required to acquire a key before using AFS facilities, but if one is
not acquired then all operations will be governed by the anonymous user parts
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index bbb0c1c0e5cf..98db2ea5fa12 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -71,9 +71,6 @@ Other Functions
.. kernel-doc:: fs/fs-writeback.c
:export:
-.. kernel-doc:: fs/block_dev.c
- :export:
-
.. kernel-doc:: fs/anon_inodes.c
:export:
@@ -86,9 +83,6 @@ Other Functions
.. kernel-doc:: fs/dax.c
:export:
-.. kernel-doc:: fs/direct-io.c
- :export:
-
.. kernel-doc:: fs/libfs.c
:export:
@@ -104,6 +98,9 @@ Other Functions
.. kernel-doc:: fs/xattr.c
:export:
+.. kernel-doc:: fs/namespace.c
+ :export:
+
The proc filesystem
===================
@@ -125,6 +122,12 @@ Events based on file descriptors
.. kernel-doc:: fs/eventfd.c
:export:
+eventpoll (epoll) interfaces
+============================
+
+.. kernel-doc:: fs/eventpoll.c
+ :internal:
+
The Filesystem for Exporting Kernel Objects
===========================================
diff --git a/Documentation/filesystems/autofs-mount-control.rst b/Documentation/filesystems/autofs-mount-control.rst
index 2903aed92316..b5a379d25c40 100644
--- a/Documentation/filesystems/autofs-mount-control.rst
+++ b/Documentation/filesystems/autofs-mount-control.rst
@@ -196,7 +196,7 @@ information and return operation results::
struct args_ismountpoint ismountpoint;
};
- char path[0];
+ char path[];
};
The ioctlfd field is a mount point file descriptor of an autofs mount
@@ -391,7 +391,7 @@ variation uses the path and optionally in.type field of struct args_ismountpoint
set to an autofs mount type. The call returns 1 if this is a mount point
and sets out.devid field to the device number of the mount and out.magic
field to the relevant super block magic number (described below) or 0 if
-it isn't a mountpoint. In both cases the the device number (as returned
+it isn't a mountpoint. In both cases the device number (as returned
by new_encode_dev()) is returned in out.devid field.
If supplied with a file descriptor we're looking for a specific mount,
@@ -399,12 +399,12 @@ not necessarily at the top of the mounted stack. In this case the path
the descriptor corresponds to is considered a mountpoint if it is itself
a mountpoint or contains a mount, such as a multi-mount without a root
mount. In this case we return 1 if the descriptor corresponds to a mount
-point and and also returns the super magic of the covering mount if there
+point and also returns the super magic of the covering mount if there
is one or 0 if it isn't a mountpoint.
If a path is supplied (and the ioctlfd field is set to -1) then the path
is looked up and is checked to see if it is the root of a mount. If a
type is also given we are looking for a particular autofs mount and if
-a match isn't found a fail is returned. If the the located path is the
+a match isn't found a fail is returned. If the located path is the
root of a mount 1 is returned along with the super magic of the mount
or 0 otherwise.
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 681c6a492bc0..3b6e38e646cd 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -35,7 +35,7 @@ This document describes only the kernel module and the interactions
required with any user-space program. Subsequent text refers to this
as the "automount daemon" or simply "the daemon".
-"autofs" is a Linux kernel module with provides the "autofs"
+"autofs" is a Linux kernel module which provides the "autofs"
filesystem type. Several "autofs" filesystems can be mounted and they
can each be managed separately, or all managed by the same daemon.
@@ -467,7 +467,7 @@ Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct args_ismountpoint ismountpoint;
};
- char path[0];
+ char path[];
};
For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.rst
index 7d9f82607562..430f0b40796b 100644
--- a/Documentation/filesystems/automount-support.txt
+++ b/Documentation/filesystems/automount-support.rst
@@ -1,3 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Automount Support
+=================
+
+
Support is available for filesystems that wish to do automounting
support (such as kAFS which can be found in fs/afs/ and NFS in
fs/nfs/). This facility includes allowing in-kernel mounts to be
@@ -5,13 +12,12 @@ performed and mountpoint degradation to be requested. The latter can
also be requested by userspace.
-======================
-IN-KERNEL AUTOMOUNTING
+In-Kernel Automounting
======================
See section "Mount Traps" of Documentation/filesystems/autofs.rst
-Then from userspace, you can just do something like:
+Then from userspace, you can just do something like::
[root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs
@@ -21,7 +27,7 @@ Then from userspace, you can just do something like:
[root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2
-And then if you look in the mountpoint catalogue, you'll see something like:
+And then if you look in the mountpoint catalogue, you'll see something like::
[root@andromeda root]# cat /proc/mounts
...
@@ -30,8 +36,7 @@ And then if you look in the mountpoint catalogue, you'll see something like:
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
-===========================
-AUTOMATIC MOUNTPOINT EXPIRY
+Automatic Mountpoint Expiry
===========================
Automatic expiration of mountpoints is easy, provided you've mounted the
@@ -43,7 +48,8 @@ To do expiration, you need to follow these steps:
hung.
(2) When a new mountpoint is created in the ->d_automount method, add
- the mnt to the list using mnt_set_expiry()
+ the mnt to the list using mnt_set_expiry()::
+
mnt_set_expiry(newmnt, &afs_vfsmounts);
(3) When you want mountpoints to be expired, call mark_mounts_for_expiry()
@@ -70,8 +76,7 @@ and the copies of those that are on an expiration list will be added to the
same expiration list.
-=======================
-USERSPACE DRIVEN EXPIRY
+Userspace Driven Expiry
=======================
As an alternative, it is possible for userspace to request expiry of any
diff --git a/Documentation/filesystems/befs.rst b/Documentation/filesystems/befs.rst
index 79f9740d76ff..a22f603b2938 100644
--- a/Documentation/filesystems/befs.rst
+++ b/Documentation/filesystems/befs.rst
@@ -106,8 +106,8 @@ iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
============= ===========================================================
-How to Get Lastest Version
-==========================
+How to Get Latest Version
+=========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
diff --git a/Documentation/filesystems/btrfs.rst b/Documentation/filesystems/btrfs.rst
index d0904f602819..a81db8f54d68 100644
--- a/Documentation/filesystems/btrfs.rst
+++ b/Documentation/filesystems/btrfs.rst
@@ -19,15 +19,24 @@ The main Btrfs features include:
* Subvolumes (separate internal filesystem roots)
* Object level mirroring and striping
* Checksums on data and metadata (multiple algorithms available)
- * Compression
+ * Compression (multiple algorithms available)
+ * Reflink, deduplication
+ * Scrub (on-line checksum verification)
+ * Hierarchical quota groups (subvolume and snapshot support)
* Integrated multiple device support, with several raid algorithms
* Offline filesystem check
- * Efficient incremental backup and FS mirroring
+ * Efficient incremental backup and FS mirroring (send/receive)
+ * Trim/discard
* Online filesystem defragmentation
+ * Swapfile support
+ * Zoned mode
+ * Read/write metadata verification
+ * Online resize (shrink, grow)
-For more information please refer to the wiki
+For more information please refer to the documentation site or wiki
+
+ https://btrfs.readthedocs.io
- https://btrfs.wiki.kernel.org
that maintains information about administration tasks, frequently asked
questions, use cases, mount options, comprehensible changelogs, features,
diff --git a/Documentation/filesystems/caching/backend-api.rst b/Documentation/filesystems/caching/backend-api.rst
new file mode 100644
index 000000000000..3a199fc50828
--- /dev/null
+++ b/Documentation/filesystems/caching/backend-api.rst
@@ -0,0 +1,479 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Cache Backend API
+=================
+
+The FS-Cache system provides an API by which actual caches can be supplied to
+FS-Cache for it to then serve out to network filesystems and other interested
+parties. This API is used by::
+
+ #include <linux/fscache-cache.h>.
+
+
+Overview
+========
+
+Interaction with the API is handled on three levels: cache, volume and data
+storage, and each level has its own type of cookie object:
+
+ ======================= =======================
+ COOKIE C TYPE
+ ======================= =======================
+ Cache cookie struct fscache_cache
+ Volume cookie struct fscache_volume
+ Data storage cookie struct fscache_cookie
+ ======================= =======================
+
+Cookies are used to provide some filesystem data to the cache, manage state and
+pin the cache during access in addition to acting as reference points for the
+API functions. Each cookie has a debugging ID that is included in trace points
+to make it easier to correlate traces. Note, though, that debugging IDs are
+simply allocated from incrementing counters and will eventually wrap.
+
+The cache backend and the network filesystem can both ask for cache cookies -
+and if they ask for one of the same name, they'll get the same cookie. Volume
+and data cookies, however, are created at the behest of the filesystem only.
+
+
+Cache Cookies
+=============
+
+Caches are represented in the API by cache cookies. These are objects of
+type::
+
+ struct fscache_cache {
+ void *cache_priv;
+ unsigned int debug_id;
+ char *name;
+ ...
+ };
+
+There are a few fields that the cache backend might be interested in. The
+``debug_id`` can be used in tracing to match lines referring to the same cache
+and ``name`` is the name the cache was registered with. The ``cache_priv``
+member is private data provided by the cache when it is brought online. The
+other fields are for internal use.
+
+
+Registering a Cache
+===================
+
+When a cache backend wants to bring a cache online, it should first register
+the cache name and that will get it a cache cookie. This is done with::
+
+ struct fscache_cache *fscache_acquire_cache(const char *name);
+
+This will look up and potentially create a cache cookie. The cache cookie may
+have already been created by a network filesystem looking for it, in which case
+that cache cookie will be used. If the cache cookie is not in use by another
+cache, it will be moved into the preparing state, otherwise it will return
+busy.
+
+If successful, the cache backend can then start setting up the cache. In the
+event that the initialisation fails, the cache backend should call::
+
+ void fscache_relinquish_cache(struct fscache_cache *cache);
+
+to reset and discard the cookie.
+
+
+Bringing a Cache Online
+=======================
+
+Once the cache is set up, it can be brought online by calling::
+
+ int fscache_add_cache(struct fscache_cache *cache,
+ const struct fscache_cache_ops *ops,
+ void *cache_priv);
+
+This stores the cache operations table pointer and cache private data into the
+cache cookie and moves the cache to the active state, thereby allowing accesses
+to take place.
+
+
+Withdrawing a Cache From Service
+================================
+
+The cache backend can withdraw a cache from service by calling this function::
+
+ void fscache_withdraw_cache(struct fscache_cache *cache);
+
+This moves the cache to the withdrawn state to prevent new cache- and
+volume-level accesses from starting and then waits for outstanding cache-level
+accesses to complete.
+
+The cache must then go through the data storage objects it has and tell fscache
+to withdraw them, calling::
+
+ void fscache_withdraw_cookie(struct fscache_cookie *cookie);
+
+on the cookie that each object belongs to. This schedules the specified cookie
+for withdrawal. This gets offloaded to a workqueue. The cache backend can
+wait for completion by calling::
+
+ void fscache_wait_for_objects(struct fscache_cache *cache);
+
+Once all the cookies are withdrawn, a cache backend can withdraw all the
+volumes, calling::
+
+ void fscache_withdraw_volume(struct fscache_volume *volume);
+
+to tell fscache that a volume has been withdrawn. This waits for all
+outstanding accesses on the volume to complete before returning.
+
+When the cache is completely withdrawn, fscache should be notified by
+calling::
+
+ void fscache_relinquish_cache(struct fscache_cache *cache);
+
+to clear fields in the cookie and discard the caller's ref on it.
+
+
+Volume Cookies
+==============
+
+Within a cache, the data storage objects are organised into logical volumes.
+These are represented in the API as objects of type::
+
+ struct fscache_volume {
+ struct fscache_cache *cache;
+ void *cache_priv;
+ unsigned int debug_id;
+ char *key;
+ unsigned int key_hash;
+ ...
+ u8 coherency_len;
+ u8 coherency[];
+ };
+
+There are a number of fields here that are of interest to the caching backend:
+
+ * ``cache`` - The parent cache cookie.
+
+ * ``cache_priv`` - A place for the cache to stash private data.
+
+ * ``debug_id`` - A debugging ID for logging in tracepoints.
+
+ * ``key`` - A printable string with no '/' characters in it that represents
+ the index key for the volume. The key is NUL-terminated and padded out to
+ a multiple of 4 bytes.
+
+ * ``key_hash`` - A hash of the index key. This should work out the same, no
+ matter the cpu arch and endianness.
+
+ * ``coherency`` - A piece of coherency data that should be checked when the
+ volume is bound to in the cache.
+
+ * ``coherency_len`` - The amount of data in the coherency buffer.
+
+
+Data Storage Cookies
+====================
+
+A volume is a logical group of data storage objects, each of which is
+represented to the network filesystem by a cookie. Cookies are represented in
+the API as objects of type::
+
+ struct fscache_cookie {
+ struct fscache_volume *volume;
+ void *cache_priv;
+ unsigned long flags;
+ unsigned int debug_id;
+ unsigned int inval_counter;
+ loff_t object_size;
+ u8 advice;
+ u32 key_hash;
+ u8 key_len;
+ u8 aux_len;
+ ...
+ };
+
+The fields in the cookie that are of interest to the cache backend are:
+
+ * ``volume`` - The parent volume cookie.
+
+ * ``cache_priv`` - A place for the cache to stash private data.
+
+ * ``flags`` - A collection of bit flags, including:
+
+ * FSCACHE_COOKIE_NO_DATA_TO_READ - There is no data available in the
+ cache to be read as the cookie has been created or invalidated.
+
+ * FSCACHE_COOKIE_NEEDS_UPDATE - The coherency data and/or object size has
+ been changed and needs committing.
+
+ * FSCACHE_COOKIE_LOCAL_WRITE - The netfs's data has been modified
+ locally, so the cache object may be in an incoherent state with respect
+ to the server.
+
+ * FSCACHE_COOKIE_HAVE_DATA - The backend should set this if it
+ successfully stores data into the cache.
+
+ * FSCACHE_COOKIE_RETIRED - The cookie was invalidated when it was
+ relinquished and the cached data should be discarded.
+
+ * ``debug_id`` - A debugging ID for logging in tracepoints.
+
+ * ``inval_counter`` - The number of invalidations done on the cookie.
+
+ * ``advice`` - Information about how the cookie is to be used.
+
+ * ``key_hash`` - A hash of the index key. This should work out the same, no
+ matter the cpu arch and endianness.
+
+ * ``key_len`` - The length of the index key.
+
+ * ``aux_len`` - The length of the coherency data buffer.
+
+Each cookie has an index key, which may be stored inline to the cookie or
+elsewhere. A pointer to this can be obtained by calling::
+
+ void *fscache_get_key(struct fscache_cookie *cookie);
+
+The index key is a binary blob, the storage for which is padded out to a
+multiple of 4 bytes.
+
+Each cookie also has a buffer for coherency data. This may also be inline or
+detached from the cookie and a pointer is obtained by calling::
+
+ void *fscache_get_aux(struct fscache_cookie *cookie);
+
+
+
+Cookie Accounting
+=================
+
+Data storage cookies are counted and this is used to block cache withdrawal
+completion until all objects have been destroyed. The following functions are
+provided to the cache to deal with that::
+
+ void fscache_count_object(struct fscache_cache *cache);
+ void fscache_uncount_object(struct fscache_cache *cache);
+ void fscache_wait_for_objects(struct fscache_cache *cache);
+
+The count function records the allocation of an object in a cache and the
+uncount function records its destruction. Warning: by the time the uncount
+function returns, the cache may have been destroyed.
+
+The wait function can be used during the withdrawal procedure to wait for
+fscache to finish withdrawing all the objects in the cache. When it completes,
+there will be no remaining objects referring to the cache object or any volume
+objects.
+
+
+Cache Management API
+====================
+
+The cache backend implements the cache management API by providing a table of
+operations that fscache can use to manage various aspects of the cache. These
+are held in a structure of type::
+
+ struct fscache_cache_ops {
+ const char *name;
+ ...
+ };
+
+This contains a printable name for the cache backend driver plus a number of
+pointers to methods to allow fscache to request management of the cache:
+
+ * Set up a volume cookie [optional]::
+
+ void (*acquire_volume)(struct fscache_volume *volume);
+
+ This method is called when a volume cookie is being created. The caller
+ holds a cache-level access pin to prevent the cache from going away for
+ the duration. This method should set up the resources to access a volume
+ in the cache and should not return until it has done so.
+
+ If successful, it can set ``cache_priv`` to its own data.
+
+
+ * Clean up volume cookie [optional]::
+
+ void (*free_volume)(struct fscache_volume *volume);
+
+ This method is called when a volume cookie is being released if
+ ``cache_priv`` is set.
+
+
+ * Look up a cookie in the cache [mandatory]::
+
+ bool (*lookup_cookie)(struct fscache_cookie *cookie);
+
+ This method is called to look up/create the resources needed to access the
+ data storage for a cookie. It is called from a worker thread with a
+ volume-level access pin in the cache to prevent it from being withdrawn.
+
+ True should be returned if successful and false otherwise. If false is
+ returned, the withdraw_cookie op (see below) will be called.
+
+ If lookup fails, but the object could still be created (e.g. it hasn't
+ been cached before), then::
+
+ void fscache_cookie_lookup_negative(
+ struct fscache_cookie *cookie);
+
+ can be called to let the network filesystem proceed and start downloading
+ stuff whilst the cache backend gets on with the job of creating things.
+
+ If successful, ``cookie->cache_priv`` can be set.
+
+
+ * Withdraw an object without any cookie access counts held [mandatory]::
+
+ void (*withdraw_cookie)(struct fscache_cookie *cookie);
+
+ This method is called to withdraw a cookie from service. It will be
+ called when the cookie is relinquished by the netfs, withdrawn or culled
+ by the cache backend or closed after a period of non-use by fscache.
+
+ The caller doesn't hold any access pins, but it is called from a
+ non-reentrant work item to manage races between the various ways
+ withdrawal can occur.
+
+ The cookie will have the ``FSCACHE_COOKIE_RETIRED`` flag set on it if the
+ associated data is to be removed from the cache.
+
+
+ * Change the size of a data storage object [mandatory]::
+
+ void (*resize_cookie)(struct netfs_cache_resources *cres,
+ loff_t new_size);
+
+ This method is called to inform the cache backend of a change in size of
+ the netfs file due to local truncation. The cache backend should make all
+ of the changes it needs to make before returning as this is done under the
+ netfs inode mutex.
+
+ The caller holds a cookie-level access pin to prevent a race with
+ withdrawal and the netfs must have the cookie marked in-use to prevent
+ garbage collection or culling from removing any resources.
+
+
+ * Invalidate a data storage object [mandatory]::
+
+ bool (*invalidate_cookie)(struct fscache_cookie *cookie);
+
+ This is called when the network filesystem detects a third-party
+ modification or when an O_DIRECT write is made locally. This requests
+ that the cache backend should throw away all the data in the cache for
+ this object and start afresh. It should return true if successful and
+ false otherwise.
+
+ On entry, new I O/operations are blocked. Once the cache is in a position
+ to accept I/O again, the backend should release the block by calling::
+
+ void fscache_resume_after_invalidation(struct fscache_cookie *cookie);
+
+ If the method returns false, caching will be withdrawn for this cookie.
+
+
+ * Prepare to make local modifications to the cache [mandatory]::
+
+ void (*prepare_to_write)(struct fscache_cookie *cookie);
+
+ This method is called when the network filesystem finds that it is going
+ to need to modify the contents of the cache due to local writes or
+ truncations. This gives the cache a chance to note that a cache object
+ may be incoherent with respect to the server and may need writing back
+ later. This may also cause the cached data to be scrapped on later
+ rebinding if not properly committed.
+
+
+ * Begin an operation for the netfs lib [mandatory]::
+
+ bool (*begin_operation)(struct netfs_cache_resources *cres,
+ enum fscache_want_state want_state);
+
+ This method is called when an I/O operation is being set up (read, write
+ or resize). The caller holds an access pin on the cookie and must have
+ marked the cookie as in-use.
+
+ If it can, the backend should attach any resources it needs to keep around
+ to the netfs_cache_resources object and return true.
+
+ If it can't complete the setup, it should return false.
+
+ The want_state parameter indicates the state the caller needs the cache
+ object to be in and what it wants to do during the operation:
+
+ * ``FSCACHE_WANT_PARAMS`` - The caller just wants to access cache
+ object parameters; it doesn't need to do data I/O yet.
+
+ * ``FSCACHE_WANT_READ`` - The caller wants to read data.
+
+ * ``FSCACHE_WANT_WRITE`` - The caller wants to write to or resize the
+ cache object.
+
+ Note that there won't necessarily be anything attached to the cookie's
+ cache_priv yet if the cookie is still being created.
+
+
+Data I/O API
+============
+
+A cache backend provides a data I/O API by through the netfs library's ``struct
+netfs_cache_ops`` attached to a ``struct netfs_cache_resources`` by the
+``begin_operation`` method described above.
+
+See the Documentation/filesystems/netfs_library.rst for a description.
+
+
+Miscellaneous Functions
+=======================
+
+FS-Cache provides some utilities that a cache backend may make use of:
+
+ * Note occurrence of an I/O error in a cache::
+
+ void fscache_io_error(struct fscache_cache *cache);
+
+ This tells FS-Cache that an I/O error occurred in the cache. This
+ prevents any new I/O from being started on the cache.
+
+ This does not actually withdraw the cache. That must be done separately.
+
+ * Note cessation of caching on a cookie due to failure::
+
+ void fscache_caching_failed(struct fscache_cookie *cookie);
+
+ This notes that a the caching that was being done on a cookie failed in
+ some way, for instance the backing storage failed to be created or
+ invalidation failed and that no further I/O operations should take place
+ on it until the cache is reset.
+
+ * Count I/O requests::
+
+ void fscache_count_read(void);
+ void fscache_count_write(void);
+
+ These record reads and writes from/to the cache. The numbers are
+ displayed in /proc/fs/fscache/stats.
+
+ * Count out-of-space errors::
+
+ void fscache_count_no_write_space(void);
+ void fscache_count_no_create_space(void);
+
+ These record ENOSPC errors in the cache, divided into failures of data
+ writes and failures of filesystem object creations (e.g. mkdir).
+
+ * Count objects culled::
+
+ void fscache_count_culled(void);
+
+ This records the culling of an object.
+
+ * Get the cookie from a set of cache resources::
+
+ struct fscache_cookie *fscache_cres_cookie(struct netfs_cache_resources *cres)
+
+ Pull a pointer to the cookie from the cache resources. This may return a
+ NULL cookie if no cookie was set.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: include/linux/fscache-cache.h
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
deleted file mode 100644
index c418280c915f..000000000000
--- a/Documentation/filesystems/caching/backend-api.txt
+++ /dev/null
@@ -1,726 +0,0 @@
- ==========================
- FS-CACHE CACHE BACKEND API
- ==========================
-
-The FS-Cache system provides an API by which actual caches can be supplied to
-FS-Cache for it to then serve out to network filesystems and other interested
-parties.
-
-This API is declared in <linux/fscache-cache.h>.
-
-
-====================================
-INITIALISING AND REGISTERING A CACHE
-====================================
-
-To start off, a cache definition must be initialised and registered for each
-cache the backend wants to make available. For instance, CacheFS does this in
-the fill_super() operation on mounting.
-
-The cache definition (struct fscache_cache) should be initialised by calling:
-
- void fscache_init_cache(struct fscache_cache *cache,
- struct fscache_cache_ops *ops,
- const char *idfmt,
- ...);
-
-Where:
-
- (*) "cache" is a pointer to the cache definition;
-
- (*) "ops" is a pointer to the table of operations that the backend supports on
- this cache; and
-
- (*) "idfmt" is a format and printf-style arguments for constructing a label
- for the cache.
-
-
-The cache should then be registered with FS-Cache by passing a pointer to the
-previously initialised cache definition to:
-
- int fscache_add_cache(struct fscache_cache *cache,
- struct fscache_object *fsdef,
- const char *tagname);
-
-Two extra arguments should also be supplied:
-
- (*) "fsdef" which should point to the object representation for the FS-Cache
- master index in this cache. Netfs primary index entries will be created
- here. FS-Cache keeps the caller's reference to the index object if
- successful and will release it upon withdrawal of the cache.
-
- (*) "tagname" which, if given, should be a text string naming this cache. If
- this is NULL, the identifier will be used instead. For CacheFS, the
- identifier is set to name the underlying block device and the tag can be
- supplied by mount.
-
-This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
-is already in use. 0 will be returned on success.
-
-
-=====================
-UNREGISTERING A CACHE
-=====================
-
-A cache can be withdrawn from the system by calling this function with a
-pointer to the cache definition:
-
- void fscache_withdraw_cache(struct fscache_cache *cache);
-
-In CacheFS's case, this is called by put_super().
-
-
-========
-SECURITY
-========
-
-The cache methods are executed one of two contexts:
-
- (1) that of the userspace process that issued the netfs operation that caused
- the cache method to be invoked, or
-
- (2) that of one of the processes in the FS-Cache thread pool.
-
-In either case, this may not be an appropriate context in which to access the
-cache.
-
-The calling process's fsuid, fsgid and SELinux security identities may need to
-be masqueraded for the duration of the cache driver's access to the cache.
-This is left to the cache to handle; FS-Cache makes no effort in this regard.
-
-
-===================================
-CONTROL AND STATISTICS PRESENTATION
-===================================
-
-The cache may present data to the outside world through FS-Cache's interfaces
-in sysfs and procfs - the former for control and the latter for statistics.
-
-A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
-is enabled. This is accessible through the kobject struct fscache_cache::kobj
-and is for use by the cache as it sees fit.
-
-
-========================
-RELEVANT DATA STRUCTURES
-========================
-
- (*) Index/Data file FS-Cache representation cookie:
-
- struct fscache_cookie {
- struct fscache_object_def *def;
- struct fscache_netfs *netfs;
- void *netfs_data;
- ...
- };
-
- The fields that might be of use to the backend describe the object
- definition, the netfs definition and the netfs's data for this cookie.
- The object definition contain functions supplied by the netfs for loading
- and matching index entries; these are required to provide some of the
- cache operations.
-
-
- (*) In-cache object representation:
-
- struct fscache_object {
- int debug_id;
- enum {
- FSCACHE_OBJECT_RECYCLING,
- ...
- } state;
- spinlock_t lock
- struct fscache_cache *cache;
- struct fscache_cookie *cookie;
- ...
- };
-
- Structures of this type should be allocated by the cache backend and
- passed to FS-Cache when requested by the appropriate cache operation. In
- the case of CacheFS, they're embedded in CacheFS's internal object
- structures.
-
- The debug_id is a simple integer that can be used in debugging messages
- that refer to a particular object. In such a case it should be printed
- using "OBJ%x" to be consistent with FS-Cache.
-
- Each object contains a pointer to the cookie that represents the object it
- is backing. An object should retired when put_object() is called if it is
- in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
- initialised by calling fscache_object_init(object).
-
-
- (*) FS-Cache operation record:
-
- struct fscache_operation {
- atomic_t usage;
- struct fscache_object *object;
- unsigned long flags;
- #define FSCACHE_OP_EXCLUSIVE
- void (*processor)(struct fscache_operation *op);
- void (*release)(struct fscache_operation *op);
- ...
- };
-
- FS-Cache has a pool of threads that it uses to give CPU time to the
- various asynchronous operations that need to be done as part of driving
- the cache. These are represented by the above structure. The processor
- method is called to give the op CPU time, and the release method to get
- rid of it when its usage count reaches 0.
-
- An operation can be made exclusive upon an object by setting the
- appropriate flag before enqueuing it with fscache_enqueue_operation(). If
- an operation needs more processing time, it should be enqueued again.
-
-
- (*) FS-Cache retrieval operation record:
-
- struct fscache_retrieval {
- struct fscache_operation op;
- struct address_space *mapping;
- struct list_head *to_do;
- ...
- };
-
- A structure of this type is allocated by FS-Cache to record retrieval and
- allocation requests made by the netfs. This struct is then passed to the
- backend to do the operation. The backend may get extra refs to it by
- calling fscache_get_retrieval() and refs may be discarded by calling
- fscache_put_retrieval().
-
- A retrieval operation can be used by the backend to do retrieval work. To
- do this, the retrieval->op.processor method pointer should be set
- appropriately by the backend and fscache_enqueue_retrieval() called to
- submit it to the thread pool. CacheFiles, for example, uses this to queue
- page examination when it detects PG_lock being cleared.
-
- The to_do field is an empty list available for the cache backend to use as
- it sees fit.
-
-
- (*) FS-Cache storage operation record:
-
- struct fscache_storage {
- struct fscache_operation op;
- pgoff_t store_limit;
- ...
- };
-
- A structure of this type is allocated by FS-Cache to record outstanding
- writes to be made. FS-Cache itself enqueues this operation and invokes
- the write_page() method on the object at appropriate times to effect
- storage.
-
-
-================
-CACHE OPERATIONS
-================
-
-The cache backend provides FS-Cache with a table of operations that can be
-performed on the denizens of the cache. These are held in a structure of type:
-
- struct fscache_cache_ops
-
- (*) Name of cache provider [mandatory]:
-
- const char *name
-
- This isn't strictly an operation, but should be pointed at a string naming
- the backend.
-
-
- (*) Allocate a new object [mandatory]:
-
- struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
- struct fscache_cookie *cookie)
-
- This method is used to allocate a cache object representation to back a
- cookie in a particular cache. fscache_object_init() should be called on
- the object to initialise it prior to returning.
-
- This function may also be used to parse the index key to be used for
- multiple lookup calls to turn it into a more convenient form. FS-Cache
- will call the lookup_complete() method to allow the cache to release the
- form once lookup is complete or aborted.
-
-
- (*) Look up and create object [mandatory]:
-
- void (*lookup_object)(struct fscache_object *object)
-
- This method is used to look up an object, given that the object is already
- allocated and attached to the cookie. This should instantiate that object
- in the cache if it can.
-
- The method should call fscache_object_lookup_negative() as soon as
- possible if it determines the object doesn't exist in the cache. If the
- object is found to exist and the netfs indicates that it is valid then
- fscache_obtained_object() should be called once the object is in a
- position to have data stored in it. Similarly, fscache_obtained_object()
- should also be called once a non-present object has been created.
-
- If a lookup error occurs, fscache_object_lookup_error() should be called
- to abort the lookup of that object.
-
-
- (*) Release lookup data [mandatory]:
-
- void (*lookup_complete)(struct fscache_object *object)
-
- This method is called to ask the cache to release any resources it was
- using to perform a lookup.
-
-
- (*) Increment object refcount [mandatory]:
-
- struct fscache_object *(*grab_object)(struct fscache_object *object)
-
- This method is called to increment the reference count on an object. It
- may fail (for instance if the cache is being withdrawn) by returning NULL.
- It should return the object pointer if successful.
-
-
- (*) Lock/Unlock object [mandatory]:
-
- void (*lock_object)(struct fscache_object *object)
- void (*unlock_object)(struct fscache_object *object)
-
- These methods are used to exclusively lock an object. It must be possible
- to schedule with the lock held, so a spinlock isn't sufficient.
-
-
- (*) Pin/Unpin object [optional]:
-
- int (*pin_object)(struct fscache_object *object)
- void (*unpin_object)(struct fscache_object *object)
-
- These methods are used to pin an object into the cache. Once pinned an
- object cannot be reclaimed to make space. Return -ENOSPC if there's not
- enough space in the cache to permit this.
-
-
- (*) Check coherency state of an object [mandatory]:
-
- int (*check_consistency)(struct fscache_object *object)
-
- This method is called to have the cache check the saved auxiliary data of
- the object against the netfs's idea of the state. 0 should be returned
- if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
- may also be returned.
-
- (*) Update object [mandatory]:
-
- int (*update_object)(struct fscache_object *object)
-
- This is called to update the index entry for the specified object. The
- new information should be in object->cookie->netfs_data. This can be
- obtained by calling object->cookie->def->get_aux()/get_attr().
-
-
- (*) Invalidate data object [mandatory]:
-
- int (*invalidate_object)(struct fscache_operation *op)
-
- This is called to invalidate a data object (as pointed to by op->object).
- All the data stored for this object should be discarded and an
- attr_changed operation should be performed. The caller will follow up
- with an object update operation.
-
- fscache_op_complete() must be called on op before returning.
-
-
- (*) Discard object [mandatory]:
-
- void (*drop_object)(struct fscache_object *object)
-
- This method is called to indicate that an object has been unbound from its
- cookie, and that the cache should release the object's resources and
- retire it if it's in state FSCACHE_OBJECT_RECYCLING.
-
- This method should not attempt to release any references held by the
- caller. The caller will invoke the put_object() method as appropriate.
-
-
- (*) Release object reference [mandatory]:
-
- void (*put_object)(struct fscache_object *object)
-
- This method is used to discard a reference to an object. The object may
- be freed when all the references to it are released.
-
-
- (*) Synchronise a cache [mandatory]:
-
- void (*sync)(struct fscache_cache *cache)
-
- This is called to ask the backend to synchronise a cache with its backing
- device.
-
-
- (*) Dissociate a cache [mandatory]:
-
- void (*dissociate_pages)(struct fscache_cache *cache)
-
- This is called to ask a cache to perform any page dissociations as part of
- cache withdrawal.
-
-
- (*) Notification that the attributes on a netfs file changed [mandatory]:
-
- int (*attr_changed)(struct fscache_object *object);
-
- This is called to indicate to the cache that certain attributes on a netfs
- file have changed (for example the maximum size a file may reach). The
- cache can read these from the netfs by calling the cookie's get_attr()
- method.
-
- The cache may use the file size information to reserve space on the cache.
- It should also call fscache_set_store_limit() to indicate to FS-Cache the
- highest byte it's willing to store for an object.
-
- This method may return -ve if an error occurred or the cache object cannot
- be expanded. In such a case, the object will be withdrawn from service.
-
- This operation is run asynchronously from FS-Cache's thread pool, and
- storage and retrieval operations from the netfs are excluded during the
- execution of this operation.
-
-
- (*) Reserve cache space for an object's data [optional]:
-
- int (*reserve_space)(struct fscache_object *object, loff_t size);
-
- This is called to request that cache space be reserved to hold the data
- for an object and the metadata used to track it. Zero size should be
- taken as request to cancel a reservation.
-
- This should return 0 if successful, -ENOSPC if there isn't enough space
- available, or -ENOMEM or -EIO on other errors.
-
- The reservation may exceed the current size of the object, thus permitting
- future expansion. If the amount of space consumed by an object would
- exceed the reservation, it's permitted to refuse requests to allocate
- pages, but not required. An object may be pruned down to its reservation
- size if larger than that already.
-
-
- (*) Request page be read from cache [mandatory]:
-
- int (*read_or_alloc_page)(struct fscache_retrieval *op,
- struct page *page,
- gfp_t gfp)
-
- This is called to attempt to read a netfs page from the cache, or to
- reserve a backing block if not. FS-Cache will have done as much checking
- as it can before calling, but most of the work belongs to the backend.
-
- If there's no page in the cache, then -ENODATA should be returned if the
- backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
- didn't.
-
- If there is suitable data in the cache, then a read operation should be
- queued and 0 returned. When the read finishes, fscache_end_io() should be
- called.
-
- The fscache_mark_pages_cached() should be called for the page if any cache
- metadata is retained. This will indicate to the netfs that the page needs
- explicit uncaching. This operation takes a pagevec, thus allowing several
- pages to be marked at once.
-
- The retrieval record pointed to by op should be retained for each page
- queued and released when I/O on the page has been formally ended.
- fscache_get/put_retrieval() are available for this purpose.
-
- The retrieval record may be used to get CPU time via the FS-Cache thread
- pool. If this is desired, the op->op.processor should be set to point to
- the appropriate processing routine, and fscache_enqueue_retrieval() should
- be called at an appropriate point to request CPU time. For instance, the
- retrieval routine could be enqueued upon the completion of a disk read.
- The to_do field in the retrieval record is provided to aid in this.
-
- If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
- returned if possible or fscache_end_io() called with a suitable error
- code.
-
- fscache_put_retrieval() should be called after a page or pages are dealt
- with. This will complete the operation when all pages are dealt with.
-
-
- (*) Request pages be read from cache [mandatory]:
-
- int (*read_or_alloc_pages)(struct fscache_retrieval *op,
- struct list_head *pages,
- unsigned *nr_pages,
- gfp_t gfp)
-
- This is like the read_or_alloc_page() method, except it is handed a list
- of pages instead of one page. Any pages on which a read operation is
- started must be added to the page cache for the specified mapping and also
- to the LRU. Such pages must also be removed from the pages list and
- *nr_pages decremented per page.
-
- If there was an error such as -ENOMEM, then that should be returned; else
- if one or more pages couldn't be read or allocated, then -ENOBUFS should
- be returned; else if one or more pages couldn't be read, then -ENODATA
- should be returned. If all the pages are dispatched then 0 should be
- returned.
-
-
- (*) Request page be allocated in the cache [mandatory]:
-
- int (*allocate_page)(struct fscache_retrieval *op,
- struct page *page,
- gfp_t gfp)
-
- This is like the read_or_alloc_page() method, except that it shouldn't
- read from the cache, even if there's data there that could be retrieved.
- It should, however, set up any internal metadata required such that
- the write_page() method can write to the cache.
-
- If there's no backing block available, then -ENOBUFS should be returned
- (or -ENOMEM if there were other problems). If a block is successfully
- allocated, then the netfs page should be marked and 0 returned.
-
-
- (*) Request pages be allocated in the cache [mandatory]:
-
- int (*allocate_pages)(struct fscache_retrieval *op,
- struct list_head *pages,
- unsigned *nr_pages,
- gfp_t gfp)
-
- This is an multiple page version of the allocate_page() method. pages and
- nr_pages should be treated as for the read_or_alloc_pages() method.
-
-
- (*) Request page be written to cache [mandatory]:
-
- int (*write_page)(struct fscache_storage *op,
- struct page *page);
-
- This is called to write from a page on which there was a previously
- successful read_or_alloc_page() call or similar. FS-Cache filters out
- pages that don't have mappings.
-
- This method is called asynchronously from the FS-Cache thread pool. It is
- not required to actually store anything, provided -ENODATA is then
- returned to the next read of this page.
-
- If an error occurred, then a negative error code should be returned,
- otherwise zero should be returned. FS-Cache will take appropriate action
- in response to an error, such as withdrawing this object.
-
- If this method returns success then FS-Cache will inform the netfs
- appropriately.
-
-
- (*) Discard retained per-page metadata [mandatory]:
-
- void (*uncache_page)(struct fscache_object *object, struct page *page)
-
- This is called when a netfs page is being evicted from the pagecache. The
- cache backend should tear down any internal representation or tracking it
- maintains for this page.
-
-
-==================
-FS-CACHE UTILITIES
-==================
-
-FS-Cache provides some utilities that a cache backend may make use of:
-
- (*) Note occurrence of an I/O error in a cache:
-
- void fscache_io_error(struct fscache_cache *cache)
-
- This tells FS-Cache that an I/O error occurred in the cache. After this
- has been called, only resource dissociation operations (object and page
- release) will be passed from the netfs to the cache backend for the
- specified cache.
-
- This does not actually withdraw the cache. That must be done separately.
-
-
- (*) Invoke the retrieval I/O completion function:
-
- void fscache_end_io(struct fscache_retrieval *op, struct page *page,
- int error);
-
- This is called to note the end of an attempt to retrieve a page. The
- error value should be 0 if successful and an error otherwise.
-
-
- (*) Record that one or more pages being retrieved or allocated have been dealt
- with:
-
- void fscache_retrieval_complete(struct fscache_retrieval *op,
- int n_pages);
-
- This is called to record the fact that one or more pages have been dealt
- with and are no longer the concern of this operation. When the number of
- pages remaining in the operation reaches 0, the operation will be
- completed.
-
-
- (*) Record operation completion:
-
- void fscache_op_complete(struct fscache_operation *op);
-
- This is called to record the completion of an operation. This deducts
- this operation from the parent object's run state, potentially permitting
- one or more pending operations to start running.
-
-
- (*) Set highest store limit:
-
- void fscache_set_store_limit(struct fscache_object *object,
- loff_t i_size);
-
- This sets the limit FS-Cache imposes on the highest byte it's willing to
- try and store for a netfs. Any page over this limit is automatically
- rejected by fscache_read_alloc_page() and co with -ENOBUFS.
-
-
- (*) Mark pages as being cached:
-
- void fscache_mark_pages_cached(struct fscache_retrieval *op,
- struct pagevec *pagevec);
-
- This marks a set of pages as being cached. After this has been called,
- the netfs must call fscache_uncache_page() to unmark the pages.
-
-
- (*) Perform coherency check on an object:
-
- enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
- const void *data,
- uint16_t datalen);
-
- This asks the netfs to perform a coherency check on an object that has
- just been looked up. The cookie attached to the object will determine the
- netfs to use. data and datalen should specify where the auxiliary data
- retrieved from the cache can be found.
-
- One of three values will be returned:
-
- (*) FSCACHE_CHECKAUX_OKAY
-
- The coherency data indicates the object is valid as is.
-
- (*) FSCACHE_CHECKAUX_NEEDS_UPDATE
-
- The coherency data needs updating, but otherwise the object is
- valid.
-
- (*) FSCACHE_CHECKAUX_OBSOLETE
-
- The coherency data indicates that the object is obsolete and should
- be discarded.
-
-
- (*) Initialise a freshly allocated object:
-
- void fscache_object_init(struct fscache_object *object);
-
- This initialises all the fields in an object representation.
-
-
- (*) Indicate the destruction of an object:
-
- void fscache_object_destroyed(struct fscache_cache *cache);
-
- This must be called to inform FS-Cache that an object that belonged to a
- cache has been destroyed and deallocated. This will allow continuation
- of the cache withdrawal process when it is stopped pending destruction of
- all the objects.
-
-
- (*) Indicate negative lookup on an object:
-
- void fscache_object_lookup_negative(struct fscache_object *object);
-
- This is called to indicate to FS-Cache that a lookup process for an object
- found a negative result.
-
- This changes the state of an object to permit reads pending on lookup
- completion to go off and start fetching data from the netfs server as it's
- known at this point that there can't be any data in the cache.
-
- This may be called multiple times on an object. Only the first call is
- significant - all subsequent calls are ignored.
-
-
- (*) Indicate an object has been obtained:
-
- void fscache_obtained_object(struct fscache_object *object);
-
- This is called to indicate to FS-Cache that a lookup process for an object
- produced a positive result, or that an object was created. This should
- only be called once for any particular object.
-
- This changes the state of an object to indicate:
-
- (1) if no call to fscache_object_lookup_negative() has been made on
- this object, that there may be data available, and that reads can
- now go and look for it; and
-
- (2) that writes may now proceed against this object.
-
-
- (*) Indicate that object lookup failed:
-
- void fscache_object_lookup_error(struct fscache_object *object);
-
- This marks an object as having encountered a fatal error (usually EIO)
- and causes it to move into a state whereby it will be withdrawn as soon
- as possible.
-
-
- (*) Indicate that a stale object was found and discarded:
-
- void fscache_object_retrying_stale(struct fscache_object *object);
-
- This is called to indicate that the lookup procedure found an object in
- the cache that the netfs decided was stale. The object has been
- discarded from the cache and the lookup will be performed again.
-
-
- (*) Indicate that the caching backend killed an object:
-
- void fscache_object_mark_killed(struct fscache_object *object,
- enum fscache_why_object_killed why);
-
- This is called to indicate that the cache backend preemptively killed an
- object. The why parameter should be set to indicate the reason:
-
- FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding.
- FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space
- FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished.
- FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space.
-
-
- (*) Get and release references on a retrieval record:
-
- void fscache_get_retrieval(struct fscache_retrieval *op);
- void fscache_put_retrieval(struct fscache_retrieval *op);
-
- These two functions are used to retain a retrieval record while doing
- asynchronous data retrieval and block allocation.
-
-
- (*) Enqueue a retrieval record for processing.
-
- void fscache_enqueue_retrieval(struct fscache_retrieval *op);
-
- This enqueues a retrieval record for processing by the FS-Cache thread
- pool. One of the threads in the pool will invoke the retrieval record's
- op->op.processor callback function. This function may be called from
- within the callback function.
-
-
- (*) List of object state names:
-
- const char *fscache_object_states[];
-
- For debugging purposes, this may be used to turn the state that an object
- is in into a text string for display purposes.
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.rst
index 28aefcbb1442..e04a27bdbe19 100644
--- a/Documentation/filesystems/caching/cachefiles.txt
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -1,8 +1,10 @@
- ===============================================
- CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
- ===============================================
+.. SPDX-License-Identifier: GPL-2.0
-Contents:
+===================================
+Cache on Already Mounted Filesystem
+===================================
+
+.. Contents:
(*) Overview.
@@ -26,9 +28,10 @@ Contents:
(*) Debugging.
+ (*) On-demand Read.
-========
-OVERVIEW
+
+Overview
========
CacheFiles is a caching backend that's meant to use as a cache a directory on
@@ -58,8 +61,8 @@ spare space and automatically contract when the set of data requires more
space.
-============
-REQUIREMENTS
+
+Requirements
============
The use of CacheFiles and its daemon requires the following features to be
@@ -79,84 +82,70 @@ It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.
-=============
-CONFIGURATION
+Configuration
=============
The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:
- (*) brun <N>%
- (*) bcull <N>%
- (*) bstop <N>%
- (*) frun <N>%
- (*) fcull <N>%
- (*) fstop <N>%
-
+ brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>%
Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.
- (*) dir <path>
-
+ dir <path>
Specify the directory containing the root of the cache. Mandatory.
- (*) tag <name>
-
+ tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".
- (*) debug <mask>
-
+ debug <mask>
Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:
+ == =================================================
1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())
+ == =================================================
- This mask can also be set through sysfs, eg:
+ This mask can also be set through sysfs, eg::
echo 5 >/sys/modules/cachefiles/parameters/debug
-==================
-STARTING THE CACHE
+Starting the Cache
==================
The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.
-The daemon is run as follows:
+The daemon is run as follows::
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
The flags are:
- (*) -d
-
+ ``-d``
Increase the debugging level. This can be specified multiple times and
is cumulative with itself.
- (*) -s
-
+ ``-s``
Send messages to stderr instead of syslog.
- (*) -n
-
+ ``-n``
Don't daemonise and go into background.
- (*) -f <configfile>
-
+ ``-f <configfile>``
Use an alternative configuration file rather than the default one.
-===============
-THINGS TO AVOID
+Things to Avoid
===============
Do not mount other things within the cache as this will cause problems. The
@@ -179,8 +168,7 @@ Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly.
-=============
-CACHE CULLING
+Cache Culling
=============
The cache may need culling occasionally to make space. This involves
@@ -192,27 +180,21 @@ Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six
"limits":
- (*) brun
- (*) frun
-
+ brun, frun
If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.
- (*) bcull
- (*) fcull
-
+ bcull, fcull
If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.
- (*) bstop
- (*) fstop
-
+ bstop, fstop
If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.
-These must be configured thusly:
+These must be configured thusly::
0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100
@@ -226,16 +208,14 @@ started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.
-===============
-CACHE STRUCTURE
+Cache Structure
===============
The CacheFiles module will create two directories in the directory it was
given:
- (*) cache/
-
- (*) graveyard/
+ * cache/
+ * graveyard/
The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
@@ -261,10 +241,10 @@ If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
-objects:
+objects::
- INDEX INDEX INDEX DATA FILES
- ========= ========== ================================= ================
+ /INDEX /INDEX /INDEX /DATA FILES
+ /=========/==========/=================================/================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
@@ -275,7 +255,7 @@ If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
-'+' prepended:
+'+' prepended::
J1223/@23/+xy...z/+kl...m/Epqr
@@ -288,11 +268,13 @@ To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:
+ =============== =============== ===============
OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."
+ =============== =============== ===============
Intermediate directories are always "@" or "+" as appropriate.
@@ -307,8 +289,7 @@ Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).
-==========================
-SECURITY MODEL AND SELINUX
+Security Model and SELinux
==========================
CacheFiles is implemented to deal properly with the LSM security features of
@@ -331,26 +312,26 @@ When the CacheFiles module is asked to bind to its cache, it:
(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
- this is:
+ this is::
cachefiles_var_t
(2) Finds the security label of the process which issued the bind request
- (presumed to be the cachefilesd daemon), which by default will be:
+ (presumed to be the cachefilesd daemon), which by default will be::
cachefilesd_t
and asks LSM to supply a security ID as which it should act given the
- daemon's label. By default, this will be:
+ daemon's label. By default, this will be::
cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID
- based on a rule of this form in the policy.
+ based on a rule of this form in the policy::
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
- For instance:
+ For instance::
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
@@ -368,9 +349,9 @@ data cached therein; nor is it permitted to create new files in the cache.
There are policy source files available in:
- http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+ https://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
-and later versions. In that tarball, see the files:
+and later versions. In that tarball, see the files::
cachefilesd.te
cachefilesd.fc
@@ -379,7 +360,7 @@ and later versions. In that tarball, see the files:
They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own
-directory and run:
+directory and run::
make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp
@@ -394,7 +375,7 @@ an auxiliary policy must be installed to label the alternate location of the
cache.
For instructions on how to add an auxiliary policy to enable the cache to be
-located elsewhere when SELinux is in enforcing mode, please see:
+located elsewhere when SELinux is in enforcing mode, please see::
/usr/share/doc/cachefilesd-*/move-cache.txt
@@ -402,8 +383,7 @@ When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources.
-==================
-A NOTE ON SECURITY
+A Note on Security
==================
CacheFiles makes use of the split security in the task_struct. It allocates
@@ -436,7 +416,7 @@ process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
-may be overridden. This is not seen externally, and is used whan a process
+may be overridden. This is not seen externally, and is used when a process
acts upon another object, for example SIGKILLing another process or opening a
file.
@@ -445,17 +425,18 @@ for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label.
-=======================
-STATISTICAL INFORMATION
+Statistical Information
=======================
-If FS-Cache is compiled with the following option enabled:
+If FS-Cache is compiled with the following option enabled::
CONFIG_CACHEFILES_HISTOGRAM=y
then it will gather certain statistics and display them through a proc file.
- (*) /proc/fs/cachefiles/histogram
+ /proc/fs/cachefiles/histogram
+
+ ::
cat /proc/fs/cachefiles/histogram
JIFS SECS LOOKUPS MKDIRS CREATES
@@ -465,37 +446,217 @@ then it will gather certain statistics and display them through a proc file.
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
+ ======= =======================================================
COLUMN TIME MEASUREMENT
======= =======================================================
LOOKUPS Length of time to perform a lookup on the backing fs
MKDIRS Length of time to perform a mkdir on the backing fs
CREATES Length of time to perform a create on the backing fs
+ ======= =======================================================
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
-=========
-DEBUGGING
+Debugging
=========
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
-debugging enabled by adjusting the value in:
+debugging enabled by adjusting the value in::
/sys/module/cachefiles/parameters/debug
This is a bitmask of debugging streams to enable:
+ ======= ======= =============================== =======================
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 General Function entry trace
1 2 Function exit trace
2 4 General
+ ======= ======= =============================== =======================
The appropriate set of values should be OR'd together and the result written to
-the control file. For example:
+the control file. For example::
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
will turn on all function entry debugging.
+
+
+On-demand Read
+==============
+
+When working in its original mode, CacheFiles serves as a local cache for a
+remote networking fs - while in on-demand read mode, CacheFiles can boost the
+scenario where on-demand read semantics are needed, e.g. container image
+distribution.
+
+The essential difference between these two modes is seen when a cache miss
+occurs: In the original mode, the netfs will fetch the data from the remote
+server and then write it to the cache file; in on-demand read mode, fetching
+the data and writing it into the cache is delegated to a user daemon.
+
+``CONFIG_CACHEFILES_ONDEMAND`` should be enabled to support on-demand read mode.
+
+
+Protocol Communication
+----------------------
+
+The on-demand read mode uses a simple protocol for communication between kernel
+and user daemon. The protocol can be modeled as::
+
+ kernel --[request]--> user daemon --[reply]--> kernel
+
+CacheFiles will send requests to the user daemon when needed. The user daemon
+should poll the devnode ('/dev/cachefiles') to check if there's a pending
+request to be processed. A POLLIN event will be returned when there's a pending
+request.
+
+The user daemon then reads the devnode to fetch a request to process. It should
+be noted that each read only gets one request. When it has finished processing
+the request, the user daemon should write the reply to the devnode.
+
+Each request starts with a message header of the form::
+
+ struct cachefiles_msg {
+ __u32 msg_id;
+ __u32 opcode;
+ __u32 len;
+ __u32 object_id;
+ __u8 data[];
+ };
+
+where:
+
+ * ``msg_id`` is a unique ID identifying this request among all pending
+ requests.
+
+ * ``opcode`` indicates the type of this request.
+
+ * ``object_id`` is a unique ID identifying the cache file operated on.
+
+ * ``data`` indicates the payload of this request.
+
+ * ``len`` indicates the whole length of this request, including the
+ header and following type-specific payload.
+
+
+Turning on On-demand Mode
+-------------------------
+
+An optional parameter becomes available to the "bind" command::
+
+ bind [ondemand]
+
+When the "bind" command is given no argument, it defaults to the original mode.
+When it is given the "ondemand" argument, i.e. "bind ondemand", on-demand read
+mode will be enabled.
+
+
+The OPEN Request
+----------------
+
+When the netfs opens a cache file for the first time, a request with the
+CACHEFILES_OP_OPEN opcode, a.k.a an OPEN request will be sent to the user
+daemon. The payload format is of the form::
+
+ struct cachefiles_open {
+ __u32 volume_key_size;
+ __u32 cookie_key_size;
+ __u32 fd;
+ __u32 flags;
+ __u8 data[];
+ };
+
+where:
+
+ * ``data`` contains the volume_key followed directly by the cookie_key.
+ The volume key is a NUL-terminated string; the cookie key is binary
+ data.
+
+ * ``volume_key_size`` indicates the size of the volume key in bytes.
+
+ * ``cookie_key_size`` indicates the size of the cookie key in bytes.
+
+ * ``fd`` indicates an anonymous fd referring to the cache file, through
+ which the user daemon can perform write/llseek file operations on the
+ cache file.
+
+
+The user daemon can use the given (volume_key, cookie_key) pair to distinguish
+the requested cache file. With the given anonymous fd, the user daemon can
+fetch the data and write it to the cache file in the background, even when
+kernel has not triggered a cache miss yet.
+
+Be noted that each cache file has a unique object_id, while it may have multiple
+anonymous fds. The user daemon may duplicate anonymous fds from the initial
+anonymous fd indicated by the @fd field through dup(). Thus each object_id can
+be mapped to multiple anonymous fds, while the usr daemon itself needs to
+maintain the mapping.
+
+When implementing a user daemon, please be careful of RLIMIT_NOFILE,
+``/proc/sys/fs/nr_open`` and ``/proc/sys/fs/file-max``. Typically these needn't
+be huge since they're related to the number of open device blobs rather than
+open files of each individual filesystem.
+
+The user daemon should reply the OPEN request by issuing a "copen" (complete
+open) command on the devnode::
+
+ copen <msg_id>,<cache_size>
+
+where:
+
+ * ``msg_id`` must match the msg_id field of the OPEN request.
+
+ * When >= 0, ``cache_size`` indicates the size of the cache file;
+ when < 0, ``cache_size`` indicates any error code encountered by the
+ user daemon.
+
+
+The CLOSE Request
+-----------------
+
+When a cookie withdrawn, a CLOSE request (opcode CACHEFILES_OP_CLOSE) will be
+sent to the user daemon. This tells the user daemon to close all anonymous fds
+associated with the given object_id. The CLOSE request has no extra payload,
+and shouldn't be replied.
+
+
+The READ Request
+----------------
+
+When a cache miss is encountered in on-demand read mode, CacheFiles will send a
+READ request (opcode CACHEFILES_OP_READ) to the user daemon. This tells the user
+daemon to fetch the contents of the requested file range. The payload is of the
+form::
+
+ struct cachefiles_read {
+ __u64 off;
+ __u64 len;
+ };
+
+where:
+
+ * ``off`` indicates the starting offset of the requested file range.
+
+ * ``len`` indicates the length of the requested file range.
+
+
+When it receives a READ request, the user daemon should fetch the requested data
+and write it to the cache file identified by object_id.
+
+When it has finished processing the READ request, the user daemon should reply
+by using the CACHEFILES_IOC_READ_COMPLETE ioctl on one of the anonymous fds
+associated with the object_id given in the READ request. The ioctl is of the
+form::
+
+ ioctl(fd, CACHEFILES_IOC_READ_COMPLETE, msg_id);
+
+where:
+
+ * ``fd`` is one of the anonymous fds associated with the object_id
+ given.
+
+ * ``msg_id`` must match the msg_id field of the READ request.
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
new file mode 100644
index 000000000000..a74d7b052dc1
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -0,0 +1,348 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+General Filesystem Caching
+==========================
+
+Overview
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFiles) and network
+filesystems::
+
+ +---------+
+ | | +--------------+
+ | NFS |--+ | |
+ | | | +-->| CacheFS |
+ +---------+ | +----------+ | | /dev/hda5 |
+ | | | | +--------------+
+ +---------+ +-------------->| | |
+ | | +-------+ | |--+
+ | AFS |----->| | | FS-Cache |
+ | | | netfs |-->| |--+
+ +---------+ +-->| lib | | | |
+ | | | | | | +--------------+
+ +---------+ | +-------+ +----------+ | | |
+ | | | +-->| CacheFiles |
+ | 9P |--+ | /var/cache |
+ | | +--------------+
+ +---------+
+
+Or to look at it another way, FS-Cache is a module that provides a caching
+facility to a network filesystem such that the cache is transparent to the
+user::
+
+ +---------+
+ | |
+ | Server |
+ | |
+ +---------+
+ | NETWORK
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ |
+ | +----------+
+ V | |
+ +---------+ | |
+ | | | |
+ | NFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ | | | +--------------+ +--------------+
+ | | | | | | | |
+ V +----------+ +-->| CacheFiles |-->| Ext3 |
+ +---------+ | /var/cache | | /dev/sda6 |
+ | | +--------------+ +--------------+
+ | VFS | ^ ^
+ | | | |
+ +---------+ +--------------+ |
+ | KERNEL SPACE | |
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
+ | USER SPACE | |
+ V | |
+ +---------+ +--------------+
+ | | | |
+ | Process | | cachefilesd |
+ | | | |
+ +---------+ +--------------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+ cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+ must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+ one-off access of a small portion of it (such as might be done with the
+ "file" program).
+
+It instead serves the cache out in chunks as and when requested by the netfs
+using it.
+
+
+FS-Cache provides the following facilities:
+
+ * More than one cache can be used at once. Caches can be selected
+ explicitly by use of tags.
+
+ * Caches can be added / removed at any time, even whilst being accessed.
+
+ * The netfs is provided with an interface that allows either party to
+ withdraw caching facilities from a file (required for (2)).
+
+ * The interface to the netfs returns as few errors as possible, preferring
+ rather to let the netfs remain oblivious.
+
+ * There are three types of cookie: cache, volume and data file cookies.
+ Cache cookies represent the cache as a whole and are not normally visible
+ to the netfs; the netfs gets a volume cookie to represent a collection of
+ files (typically something that a netfs would get for a superblock); and
+ data file cookies are used to cache data (something that would be got for
+ an inode).
+
+ * Volumes are matched using a key. This is a printable string that is used
+ to encode all the information that might be needed to distinguish one
+ superblock, say, from another. This would be a compound of things like
+ cell name or server address, volume name or share path. It must be a
+ valid pathname.
+
+ * Cookies are matched using a key. This is a binary blob and is used to
+ represent the object within a volume (so the volume key need not form
+ part of the blob). This might include things like an inode number and
+ uniquifier or a file handle.
+
+ * Cookie resources are set up and pinned by marking the cookie in-use.
+ This prevents the backing resources from being culled. Timed garbage
+ collection is employed to eliminate cookies that haven't been used for a
+ short while, thereby reducing resource overload. This is intended to be
+ used when a file is opened or closed.
+
+ A cookie can be marked in-use multiple times simultaneously; each mark
+ must be unused.
+
+ * Begin/end access functions are provided to delay cache withdrawal for the
+ duration of an operation and prevent structs from being freed whilst
+ we're looking at them.
+
+ * Data I/O is done by asynchronous DIO to/from a buffer described by the
+ netfs using an iov_iter.
+
+ * An invalidation facility is available to discard data from the cache and
+ to deal with I/O that's in progress that is accessing old data.
+
+ * Cookies can be "retired" upon release, thereby causing the object to be
+ removed from the cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/netfs-api.rst
+
+The cache backend API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/backend-api.rst
+
+
+Statistical Information
+=======================
+
+If FS-Cache is compiled with the following options enabled::
+
+ CONFIG_FSCACHE_STATS=y
+
+then it will gather certain statistics and display them through:
+
+ /proc/fs/fscache/stats
+
+This shows counts of a number of events that can happen in FS-Cache:
+
++--------------+-------+-------------------------------------------------------+
+|CLASS |EVENT |MEANING |
++==============+=======+=======================================================+
+|Cookies |n=N |Number of data storage cookies allocated |
++ +-------+-------------------------------------------------------+
+| |v=N |Number of volume index cookies allocated |
++ +-------+-------------------------------------------------------+
+| |vcol=N |Number of volume index key collisions |
++ +-------+-------------------------------------------------------+
+| |voom=N |Number of OOM events when allocating volume cookies |
++--------------+-------+-------------------------------------------------------+
+|Acquire |n=N |Number of acquire cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of acq reqs succeeded |
++ +-------+-------------------------------------------------------+
+| |oom=N |Number of acq reqs failed on ENOMEM |
++--------------+-------+-------------------------------------------------------+
+|LRU |n=N |Number of cookies currently on the LRU |
++ +-------+-------------------------------------------------------+
+| |exp=N |Number of cookies expired off of the LRU |
++ +-------+-------------------------------------------------------+
+| |rmv=N |Number of cookies removed from the LRU |
++ +-------+-------------------------------------------------------+
+| |drp=N |Number of LRU'd cookies relinquished/withdrawn |
++ +-------+-------------------------------------------------------+
+| |at=N |Time till next LRU cull (jiffies) |
++--------------+-------+-------------------------------------------------------+
+|Invals |n=N |Number of invalidations |
++--------------+-------+-------------------------------------------------------+
+|Updates |n=N |Number of update cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |rsz=N |Number of resize requests |
++ +-------+-------------------------------------------------------+
+| |rsn=N |Number of skipped resize requests |
++--------------+-------+-------------------------------------------------------+
+|Relinqs |n=N |Number of relinquish cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |rtr=N |Number of rlq reqs with retire=true |
++ +-------+-------------------------------------------------------+
+| |drop=N |Number of cookies no longer blocking re-acquisition |
++--------------+-------+-------------------------------------------------------+
+|NoSpace |nwr=N |Number of write requests refused due to lack of space |
++ +-------+-------------------------------------------------------+
+| |ncr=N |Number of create requests refused due to lack of space |
++ +-------+-------------------------------------------------------+
+| |cull=N |Number of objects culled to make space |
++--------------+-------+-------------------------------------------------------+
+|IO |rd=N |Number of read operations in the cache |
++ +-------+-------------------------------------------------------+
+| |wr=N |Number of write operations in the cache |
++--------------+-------+-------------------------------------------------------+
+
+Netfslib will also add some stats counters of its own.
+
+
+Cache List
+==========
+
+FS-Cache provides a list of cache cookies:
+
+ /proc/fs/fscache/cookies
+
+This will look something like::
+
+ # cat /proc/fs/fscache/caches
+ CACHE REF VOLS OBJS ACCES S NAME
+ ======== ===== ===== ===== ===== = ===============
+ 00000001 2 1 2123 1 A default
+
+where the columns are:
+
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ CACHE Cache cookie debug ID (also appears in traces)
+ REF Number of references on the cache cookie
+ VOLS Number of volumes cookies in this cache
+ OBJS Number of cache objects in use
+ ACCES Number of accesses pinning the cache
+ S State
+ NAME Name of the cache.
+ ======= ===============================================================
+
+The state can be (-) Inactive, (P)reparing, (A)ctive, (E)rror or (W)ithdrawing.
+
+
+Volume List
+===========
+
+FS-Cache provides a list of volume cookies:
+
+ /proc/fs/fscache/volumes
+
+This will look something like::
+
+ VOLUME REF nCOOK ACC FL CACHE KEY
+ ======== ===== ===== === == =============== ================
+ 00000001 55 54 1 00 default afs,example.com,100058
+
+where the columns are:
+
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ VOLUME The volume cookie debug ID (also appears in traces)
+ REF Number of references on the volume cookie
+ nCOOK Number of cookies in the volume
+ ACC Number of accesses pinning the cache
+ FL Flags on the volume cookie
+ CACHE Name of the cache or "-"
+ KEY The indexing key for the volume
+ ======= ===============================================================
+
+
+Cookie List
+===========
+
+FS-Cache provides a list of cookies:
+
+ /proc/fs/fscache/cookies
+
+This will look something like::
+
+ # head /proc/fs/fscache/cookies
+ COOKIE VOLUME REF ACT ACC S FL DEF
+ ======== ======== === === === = == ================
+ 00000435 00000001 1 0 -1 - 08 0000000201d080070000000000000000, 0000000000000000
+ 00000436 00000001 1 0 -1 - 00 0000005601d080080000000000000000, 0000000000000051
+ 00000437 00000001 1 0 -1 - 08 00023b3001d0823f0000000000000000, 0000000000000000
+ 00000438 00000001 1 0 -1 - 08 0000005801d0807b0000000000000000, 0000000000000000
+ 00000439 00000001 1 0 -1 - 08 00023b3201d080a10000000000000000, 0000000000000000
+ 0000043a 00000001 1 0 -1 - 08 00023b3401d080a30000000000000000, 0000000000000000
+ 0000043b 00000001 1 0 -1 - 08 00023b3601d080b30000000000000000, 0000000000000000
+ 0000043c 00000001 1 0 -1 - 08 00023b3801d080b40000000000000000, 0000000000000000
+
+where the columns are:
+
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ COOKIE The cookie debug ID (also appears in traces)
+ VOLUME The parent volume cookie debug ID
+ REF Number of references on the volume cookie
+ ACT Number of times the cookie is marked for in use
+ ACC Number of access pins in the cookie
+ S State of the cookie
+ FL Flags on the cookie
+ DEF Key, auxiliary data
+ ======= ===============================================================
+
+
+Debugging
+=========
+
+If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
+debugging enabled by adjusting the value in::
+
+ /sys/module/fscache/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ ======= ======= =============================== =======================
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 Cache management Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+ 3 8 Cookie management Function entry trace
+ 4 16 Function exit trace
+ 5 32 General
+ 6-8 (Not used)
+ 9 512 I/O operation management Function entry trace
+ 10 1024 Function exit trace
+ 11 2048 General
+ ======= ======= =============================== =======================
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example::
+
+ echo $((1|8|512)) >/sys/module/fscache/parameters/debug
+
+will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
deleted file mode 100644
index 50f0a5757f48..000000000000
--- a/Documentation/filesystems/caching/fscache.txt
+++ /dev/null
@@ -1,448 +0,0 @@
- ==========================
- General Filesystem Caching
- ==========================
-
-========
-OVERVIEW
-========
-
-This facility is a general purpose cache for network filesystems, though it
-could be used for caching other things such as ISO9660 filesystems too.
-
-FS-Cache mediates between cache backends (such as CacheFS) and network
-filesystems:
-
- +---------+
- | | +--------------+
- | NFS |--+ | |
- | | | +-->| CacheFS |
- +---------+ | +----------+ | | /dev/hda5 |
- | | | | +--------------+
- +---------+ +-->| | |
- | | | |--+
- | AFS |----->| FS-Cache |
- | | | |--+
- +---------+ +-->| | |
- | | | | +--------------+
- +---------+ | +----------+ | | |
- | | | +-->| CacheFiles |
- | ISOFS |--+ | /var/cache |
- | | +--------------+
- +---------+
-
-Or to look at it another way, FS-Cache is a module that provides a caching
-facility to a network filesystem such that the cache is transparent to the
-user:
-
- +---------+
- | |
- | Server |
- | |
- +---------+
- | NETWORK
- ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- |
- | +----------+
- V | |
- +---------+ | |
- | | | |
- | NFS |----->| FS-Cache |
- | | | |--+
- +---------+ | | | +--------------+ +--------------+
- | | | | | | | |
- V +----------+ +-->| CacheFiles |-->| Ext3 |
- +---------+ | /var/cache | | /dev/sda6 |
- | | +--------------+ +--------------+
- | VFS | ^ ^
- | | | |
- +---------+ +--------------+ |
- | KERNEL SPACE | |
- ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
- | USER SPACE | |
- V | |
- +---------+ +--------------+
- | | | |
- | Process | | cachefilesd |
- | | | |
- +---------+ +--------------+
-
-
-FS-Cache does not follow the idea of completely loading every netfs file
-opened in its entirety into a cache before permitting it to be accessed and
-then serving the pages out of that cache rather than the netfs inode because:
-
- (1) It must be practical to operate without a cache.
-
- (2) The size of any accessible file must not be limited to the size of the
- cache.
-
- (3) The combined size of all opened files (this includes mapped libraries)
- must not be limited to the size of the cache.
-
- (4) The user should not be forced to download an entire file just to do a
- one-off access of a small portion of it (such as might be done with the
- "file" program).
-
-It instead serves the cache out in PAGE_SIZE chunks as and when requested by
-the netfs('s) using it.
-
-
-FS-Cache provides the following facilities:
-
- (1) More than one cache can be used at once. Caches can be selected
- explicitly by use of tags.
-
- (2) Caches can be added / removed at any time.
-
- (3) The netfs is provided with an interface that allows either party to
- withdraw caching facilities from a file (required for (2)).
-
- (4) The interface to the netfs returns as few errors as possible, preferring
- rather to let the netfs remain oblivious.
-
- (5) Cookies are used to represent indices, files and other objects to the
- netfs. The simplest cookie is just a NULL pointer - indicating nothing
- cached there.
-
- (6) The netfs is allowed to propose - dynamically - any index hierarchy it
- desires, though it must be aware that the index search function is
- recursive, stack space is limited, and indices can only be children of
- indices.
-
- (7) Data I/O is done direct to and from the netfs's pages. The netfs
- indicates that page A is at index B of the data-file represented by cookie
- C, and that it should be read or written. The cache backend may or may
- not start I/O on that page, but if it does, a netfs callback will be
- invoked to indicate completion. The I/O may be either synchronous or
- asynchronous.
-
- (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
- them as obsolete and the index hierarchy rooted at that point will get
- recycled.
-
- (9) The netfs provides a "match" function for index searches. In addition to
- saying whether a match was made or not, this can also specify that an
- entry should be updated or deleted.
-
-(10) As much as possible is done asynchronously.
-
-
-FS-Cache maintains a virtual indexing tree in which all indices, files, objects
-and pages are kept. Bits of this tree may actually reside in one or more
-caches.
-
- FSDEF
- |
- +------------------------------------+
- | |
- NFS AFS
- | |
- +--------------------------+ +-----------+
- | | | |
- homedir mirror afs.org redhat.com
- | | |
- +------------+ +---------------+ +----------+
- | | | | | |
- 00001 00002 00007 00125 vol00001 vol00002
- | | | | |
- +---+---+ +-----+ +---+ +------+------+ +-----+----+
- | | | | | | | | | | | | |
-PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
- | |
- PG0 +-------+
- | |
- 00001 00003
- |
- +---+---+
- | | |
- PG0 PG1 PG2
-
-In the example above, you can see two netfs's being backed: NFS and AFS. These
-have different index hierarchies:
-
- (*) The NFS primary index contains per-server indices. Each server index is
- indexed by NFS file handles to get data file objects. Each data file
- objects can have an array of pages, but may also have further child
- objects, such as extended attributes and directory entries. Extended
- attribute objects themselves have page-array contents.
-
- (*) The AFS primary index contains per-cell indices. Each cell index contains
- per-logical-volume indices. Each of volume index contains up to three
- indices for the read-write, read-only and backup mirrors of those volumes.
- Each of these contains vnode data file objects, each of which contains an
- array of pages.
-
-The very top index is the FS-Cache master index in which individual netfs's
-have entries.
-
-Any index object may reside in more than one cache, provided it only has index
-children. Any index with non-index object children will be assumed to only
-reside in one cache.
-
-
-The netfs API to FS-Cache can be found in:
-
- Documentation/filesystems/caching/netfs-api.txt
-
-The cache backend API to FS-Cache can be found in:
-
- Documentation/filesystems/caching/backend-api.txt
-
-A description of the internal representations and object state machine can be
-found in:
-
- Documentation/filesystems/caching/object.txt
-
-
-=======================
-STATISTICAL INFORMATION
-=======================
-
-If FS-Cache is compiled with the following options enabled:
-
- CONFIG_FSCACHE_STATS=y
- CONFIG_FSCACHE_HISTOGRAM=y
-
-then it will gather certain statistics and display them through a number of
-proc files.
-
- (*) /proc/fs/fscache/stats
-
- This shows counts of a number of events that can happen in FS-Cache:
-
- CLASS EVENT MEANING
- ======= ======= =======================================================
- Cookies idx=N Number of index cookies allocated
- dat=N Number of data storage cookies allocated
- spc=N Number of special cookies allocated
- Objects alc=N Number of objects allocated
- nal=N Number of object allocation failures
- avl=N Number of objects that reached the available state
- ded=N Number of objects that reached the dead state
- ChkAux non=N Number of objects that didn't have a coherency check
- ok=N Number of objects that passed a coherency check
- upd=N Number of objects that needed a coherency data update
- obs=N Number of objects that were declared obsolete
- Pages mrk=N Number of pages marked as being cached
- unc=N Number of uncache page requests seen
- Acquire n=N Number of acquire cookie requests seen
- nul=N Number of acq reqs given a NULL parent
- noc=N Number of acq reqs rejected due to no cache available
- ok=N Number of acq reqs succeeded
- nbf=N Number of acq reqs rejected due to error
- oom=N Number of acq reqs failed on ENOMEM
- Lookups n=N Number of lookup calls made on cache backends
- neg=N Number of negative lookups made
- pos=N Number of positive lookups made
- crt=N Number of objects created by lookup
- tmo=N Number of lookups timed out and requeued
- Updates n=N Number of update cookie requests seen
- nul=N Number of upd reqs given a NULL parent
- run=N Number of upd reqs granted CPU time
- Relinqs n=N Number of relinquish cookie requests seen
- nul=N Number of rlq reqs given a NULL parent
- wcr=N Number of rlq reqs waited on completion of creation
- AttrChg n=N Number of attribute changed requests seen
- ok=N Number of attr changed requests queued
- nbf=N Number of attr changed rejected -ENOBUFS
- oom=N Number of attr changed failed -ENOMEM
- run=N Number of attr changed ops given CPU time
- Allocs n=N Number of allocation requests seen
- ok=N Number of successful alloc reqs
- wt=N Number of alloc reqs that waited on lookup completion
- nbf=N Number of alloc reqs rejected -ENOBUFS
- int=N Number of alloc reqs aborted -ERESTARTSYS
- ops=N Number of alloc reqs submitted
- owt=N Number of alloc reqs waited for CPU time
- abt=N Number of alloc reqs aborted due to object death
- Retrvls n=N Number of retrieval (read) requests seen
- ok=N Number of successful retr reqs
- wt=N Number of retr reqs that waited on lookup completion
- nod=N Number of retr reqs returned -ENODATA
- nbf=N Number of retr reqs rejected -ENOBUFS
- int=N Number of retr reqs aborted -ERESTARTSYS
- oom=N Number of retr reqs failed -ENOMEM
- ops=N Number of retr reqs submitted
- owt=N Number of retr reqs waited for CPU time
- abt=N Number of retr reqs aborted due to object death
- Stores n=N Number of storage (write) requests seen
- ok=N Number of successful store reqs
- agn=N Number of store reqs on a page already pending storage
- nbf=N Number of store reqs rejected -ENOBUFS
- oom=N Number of store reqs failed -ENOMEM
- ops=N Number of store reqs submitted
- run=N Number of store reqs granted CPU time
- pgs=N Number of pages given store req processing time
- rxd=N Number of store reqs deleted from tracking tree
- olm=N Number of store reqs over store limit
- VmScan nos=N Number of release reqs against pages with no pending store
- gon=N Number of release reqs against pages stored by time lock granted
- bsy=N Number of release reqs ignored due to in-progress store
- can=N Number of page stores cancelled due to release req
- Ops pend=N Number of times async ops added to pending queues
- run=N Number of times async ops given CPU time
- enq=N Number of times async ops queued for processing
- can=N Number of async ops cancelled
- rej=N Number of async ops rejected due to object lookup/create failure
- ini=N Number of async ops initialised
- dfr=N Number of async ops queued for deferred release
- rel=N Number of async ops released (should equal ini=N when idle)
- gc=N Number of deferred-release async ops garbage collected
- CacheOp alo=N Number of in-progress alloc_object() cache ops
- luo=N Number of in-progress lookup_object() cache ops
- luc=N Number of in-progress lookup_complete() cache ops
- gro=N Number of in-progress grab_object() cache ops
- upo=N Number of in-progress update_object() cache ops
- dro=N Number of in-progress drop_object() cache ops
- pto=N Number of in-progress put_object() cache ops
- syn=N Number of in-progress sync_cache() cache ops
- atc=N Number of in-progress attr_changed() cache ops
- rap=N Number of in-progress read_or_alloc_page() cache ops
- ras=N Number of in-progress read_or_alloc_pages() cache ops
- alp=N Number of in-progress allocate_page() cache ops
- als=N Number of in-progress allocate_pages() cache ops
- wrp=N Number of in-progress write_page() cache ops
- ucp=N Number of in-progress uncache_page() cache ops
- dsp=N Number of in-progress dissociate_pages() cache ops
- CacheEv nsp=N Number of object lookups/creations rejected due to lack of space
- stl=N Number of stale objects deleted
- rtr=N Number of objects retired when relinquished
- cul=N Number of objects culled
-
-
- (*) /proc/fs/fscache/histogram
-
- cat /proc/fs/fscache/histogram
- JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
- ===== ===== ========= ========= ========= ========= =========
-
- This shows the breakdown of the number of times each amount of time
- between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
- columns are as follows:
-
- COLUMN TIME MEASUREMENT
- ======= =======================================================
- OBJ INST Length of time to instantiate an object
- OP RUNS Length of time a call to process an operation took
- OBJ RUNS Length of time a call to process an object event took
- RETRV DLY Time between an requesting a read and lookup completing
- RETRIEVLS Time between beginning and end of a retrieval
-
- Each row shows the number of events that took a particular range of times.
- Each step is 1 jiffy in size. The JIFS column indicates the particular
- jiffy range covered, and the SECS field the equivalent number of seconds.
-
-
-===========
-OBJECT LIST
-===========
-
-If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
-list of all the objects currently allocated and allow them to be viewed
-through:
-
- /proc/fs/fscache/objects
-
-This will look something like:
-
- [root@andromeda ~]# head /proc/fs/fscache/objects
- OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
- ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
- 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
- 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
-
-where the first set of columns before the '|' describe the object:
-
- COLUMN DESCRIPTION
- ======= ===============================================================
- OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
- PARENT Debugging ID of parent object
- STAT Object state
- CHLDN Number of child objects of this object
- OPS Number of outstanding operations on this object
- OOP Number of outstanding child object management operations
- IPR
- EX Number of outstanding exclusive operations
- READS Number of outstanding read operations
- EM Object's event mask
- EV Events raised on this object
- F Object flags
- S Object work item busy state mask (1:pending 2:running)
-
-and the second set of columns describe the object's cookie, if present:
-
- COLUMN DESCRIPTION
- =============== =======================================================
- NETFS_COOKIE_DEF Name of netfs cookie definition
- TY Cookie type (IX - index, DT - data, hex - special)
- FL Cookie flags
- NETFS_DATA Netfs private data stored in the cookie
- OBJECT_KEY Object key } 1 column, with separating comma
- AUX_DATA Object aux data } presence may be configured
-
-The data shown may be filtered by attaching the a key to an appropriate keyring
-before viewing the file. Something like:
-
- keyctl add user fscache:objlist <restrictions> @s
-
-where <restrictions> are a selection of the following letters:
-
- K Show hexdump of object key (don't show if not given)
- A Show hexdump of object aux data (don't show if not given)
-
-and the following paired letters:
-
- C Show objects that have a cookie
- c Show objects that don't have a cookie
- B Show objects that are busy
- b Show objects that aren't busy
- W Show objects that have pending writes
- w Show objects that don't have pending writes
- R Show objects that have outstanding reads
- r Show objects that don't have outstanding reads
- S Show objects that have work queued
- s Show objects that don't have work queued
-
-If neither side of a letter pair is given, then both are implied. For example:
-
- keyctl add user fscache:objlist KB @s
-
-shows objects that are busy, and lists their object keys, but does not dump
-their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
-not implied.
-
-By default all objects and all fields will be shown.
-
-
-=========
-DEBUGGING
-=========
-
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in:
-
- /sys/module/fscache/parameters/debug
-
-This is a bitmask of debugging streams to enable:
-
- BIT VALUE STREAM POINT
- ======= ======= =============================== =======================
- 0 1 Cache management Function entry trace
- 1 2 Function exit trace
- 2 4 General
- 3 8 Cookie management Function entry trace
- 4 16 Function exit trace
- 5 32 General
- 6 64 Page handling Function entry trace
- 7 128 Function exit trace
- 8 256 General
- 9 512 Operation management Function entry trace
- 10 1024 Function exit trace
- 11 2048 General
-
-The appropriate set of values should be OR'd together and the result written to
-the control file. For example:
-
- echo $((1|8|64)) >/sys/module/fscache/parameters/debug
-
-will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/index.rst b/Documentation/filesystems/caching/index.rst
new file mode 100644
index 000000000000..df4307124b00
--- /dev/null
+++ b/Documentation/filesystems/caching/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Filesystem Caching
+==================
+
+.. toctree::
+ :maxdepth: 2
+
+ fscache
+ netfs-api
+ backend-api
+ cachefiles
diff --git a/Documentation/filesystems/caching/netfs-api.rst b/Documentation/filesystems/caching/netfs-api.rst
new file mode 100644
index 000000000000..665b27f1556e
--- /dev/null
+++ b/Documentation/filesystems/caching/netfs-api.rst
@@ -0,0 +1,452 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Network Filesystem Caching API
+==============================
+
+Fscache provides an API by which a network filesystem can make use of local
+caching facilities. The API is arranged around a number of principles:
+
+ (1) A cache is logically organised into volumes and data storage objects
+ within those volumes.
+
+ (2) Volumes and data storage objects are represented by various types of
+ cookie.
+
+ (3) Cookies have keys that distinguish them from their peers.
+
+ (4) Cookies have coherency data that allows a cache to determine if the
+ cached data is still valid.
+
+ (5) I/O is done asynchronously where possible.
+
+This API is used by::
+
+ #include <linux/fscache.h>.
+
+.. This document contains the following sections:
+
+ (1) Overview
+ (2) Volume registration
+ (3) Data file registration
+ (4) Declaring a cookie to be in use
+ (5) Resizing a data file (truncation)
+ (6) Data I/O API
+ (7) Data file coherency
+ (8) Data file invalidation
+ (9) Write back resource management
+ (10) Caching of local modifications
+ (11) Page release and invalidation
+
+
+Overview
+========
+
+The fscache hierarchy is organised on two levels from a network filesystem's
+point of view. The upper level represents "volumes" and the lower level
+represents "data storage objects". These are represented by two types of
+cookie, hereafter referred to as "volume cookies" and "cookies".
+
+A network filesystem acquires a volume cookie for a volume using a volume key,
+which represents all the information that defines that volume (e.g. cell name
+or server address, volume ID or share name). This must be rendered as a
+printable string that can be used as a directory name (ie. no '/' characters
+and shouldn't begin with a '.'). The maximum name length is one less than the
+maximum size of a filename component (allowing the cache backend one char for
+its own purposes).
+
+A filesystem would typically have a volume cookie for each superblock.
+
+The filesystem then acquires a cookie for each file within that volume using an
+object key. Object keys are binary blobs and only need to be unique within
+their parent volume. The cache backend is responsible for rendering the binary
+blob into something it can use and may employ hash tables, trees or whatever to
+improve its ability to find an object. This is transparent to the network
+filesystem.
+
+A filesystem would typically have a cookie for each inode, and would acquire it
+in iget and relinquish it when evicting the cookie.
+
+Once it has a cookie, the filesystem needs to mark the cookie as being in use.
+This causes fscache to send the cache backend off to look up/create resources
+for the cookie in the background, to check its coherency and, if necessary, to
+mark the object as being under modification.
+
+A filesystem would typically "use" the cookie in its file open routine and
+unuse it in file release and it needs to use the cookie around calls to
+truncate the cookie locally. It *also* needs to use the cookie when the
+pagecache becomes dirty and unuse it when writeback is complete. This is
+slightly tricky, and provision is made for it.
+
+When performing a read, write or resize on a cookie, the filesystem must first
+begin an operation. This copies the resources into a holding struct and puts
+extra pins into the cache to stop cache withdrawal from tearing down the
+structures being used. The actual operation can then be issued and conflicting
+invalidations can be detected upon completion.
+
+The filesystem is expected to use netfslib to access the cache, but that's not
+actually required and it can use the fscache I/O API directly.
+
+
+Volume Registration
+===================
+
+The first step for a network filesystem is to acquire a volume cookie for the
+volume it wants to access::
+
+ struct fscache_volume *
+ fscache_acquire_volume(const char *volume_key,
+ const char *cache_name,
+ const void *coherency_data,
+ size_t coherency_len);
+
+This function creates a volume cookie with the specified volume key as its name
+and notes the coherency data.
+
+The volume key must be a printable string with no '/' characters in it. It
+should begin with the name of the filesystem and should be no longer than 254
+characters. It should uniquely represent the volume and will be matched with
+what's stored in the cache.
+
+The caller may also specify the name of the cache to use. If specified,
+fscache will look up or create a cache cookie of that name and will use a cache
+of that name if it is online or comes online. If no cache name is specified,
+it will use the first cache that comes to hand and set the name to that.
+
+The specified coherency data is stored in the cookie and will be matched
+against coherency data stored on disk. The data pointer may be NULL if no data
+is provided. If the coherency data doesn't match, the entire cache volume will
+be invalidated.
+
+This function can return errors such as EBUSY if the volume key is already in
+use by an acquired volume or ENOMEM if an allocation failure occurred. It may
+also return a NULL volume cookie if fscache is not enabled. It is safe to
+pass a NULL cookie to any function that takes a volume cookie. This will
+cause that function to do nothing.
+
+
+When the network filesystem has finished with a volume, it should relinquish it
+by calling::
+
+ void fscache_relinquish_volume(struct fscache_volume *volume,
+ const void *coherency_data,
+ bool invalidate);
+
+This will cause the volume to be committed or removed, and if sealed the
+coherency data will be set to the value supplied. The amount of coherency data
+must match the length specified when the volume was acquired. Note that all
+data cookies obtained in this volume must be relinquished before the volume is
+relinquished.
+
+
+Data File Registration
+======================
+
+Once it has a volume cookie, a network filesystem can use it to acquire a
+cookie for data storage::
+
+ struct fscache_cookie *
+ fscache_acquire_cookie(struct fscache_volume *volume,
+ u8 advice,
+ const void *index_key,
+ size_t index_key_len,
+ const void *aux_data,
+ size_t aux_data_len,
+ loff_t object_size)
+
+This creates the cookie in the volume using the specified index key. The index
+key is a binary blob of the given length and must be unique for the volume.
+This is saved into the cookie. There are no restrictions on the content, but
+its length shouldn't exceed about three quarters of the maximum filename length
+to allow for encoding.
+
+The caller should also pass in a piece of coherency data in aux_data. A buffer
+of size aux_data_len will be allocated and the coherency data copied in. It is
+assumed that the size is invariant over time. The coherency data is used to
+check the validity of data in the cache. Functions are provided by which the
+coherency data can be updated.
+
+The file size of the object being cached should also be provided. This may be
+used to trim the data and will be stored with the coherency data.
+
+This function never returns an error, though it may return a NULL cookie on
+allocation failure or if fscache is not enabled. It is safe to pass in a NULL
+volume cookie and pass the NULL cookie returned to any function that takes it.
+This will cause that function to do nothing.
+
+
+When the network filesystem has finished with a cookie, it should relinquish it
+by calling::
+
+ void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+ bool retire);
+
+This will cause fscache to either commit the storage backing the cookie or
+delete it.
+
+
+Marking A Cookie In-Use
+=======================
+
+Once a cookie has been acquired by a network filesystem, the filesystem should
+tell fscache when it intends to use the cookie (typically done on file open)
+and should say when it has finished with it (typically on file close)::
+
+ void fscache_use_cookie(struct fscache_cookie *cookie,
+ bool will_modify);
+ void fscache_unuse_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
+ const loff_t *object_size);
+
+The *use* function tells fscache that it will use the cookie and, additionally,
+indicate if the user is intending to modify the contents locally. If not yet
+done, this will trigger the cache backend to go and gather the resources it
+needs to access/store data in the cache. This is done in the background, and
+so may not be complete by the time the function returns.
+
+The *unuse* function indicates that a filesystem has finished using a cookie.
+It optionally updates the stored coherency data and object size and then
+decreases the in-use counter. When the last user unuses the cookie, it is
+scheduled for garbage collection. If not reused within a short time, the
+resources will be released to reduce system resource consumption.
+
+A cookie must be marked in-use before it can be accessed for read, write or
+resize - and an in-use mark must be kept whilst there is dirty data in the
+pagecache in order to avoid an oops due to trying to open a file during process
+exit.
+
+Note that in-use marks are cumulative. For each time a cookie is marked
+in-use, it must be unused.
+
+
+Resizing A Data File (Truncation)
+=================================
+
+If a network filesystem file is resized locally by truncation, the following
+should be called to notify the cache::
+
+ void fscache_resize_cookie(struct fscache_cookie *cookie,
+ loff_t new_size);
+
+The caller must have first marked the cookie in-use. The cookie and the new
+size are passed in and the cache is synchronously resized. This is expected to
+be called from ``->setattr()`` inode operation under the inode lock.
+
+
+Data I/O API
+============
+
+To do data I/O operations directly through a cookie, the following functions
+are available::
+
+ int fscache_begin_read_operation(struct netfs_cache_resources *cres,
+ struct fscache_cookie *cookie);
+ int fscache_read(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ enum netfs_read_from_hole read_hole,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+ int fscache_write(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+
+The *begin* function sets up an operation, attaching the resources required to
+the cache resources block from the cookie. Assuming it doesn't return an error
+(for instance, it will return -ENOBUFS if given a NULL cookie, but otherwise do
+nothing), then one of the other two functions can be issued.
+
+The *read* and *write* functions initiate a direct-IO operation. Both take the
+previously set up cache resources block, an indication of the start file
+position, and an I/O iterator that describes buffer and indicates the amount of
+data.
+
+The read function also takes a parameter to indicate how it should handle a
+partially populated region (a hole) in the disk content. This may be to ignore
+it, skip over an initial hole and place zeros in the buffer or give an error.
+
+The read and write functions can be given an optional termination function that
+will be run on completion::
+
+ typedef
+ void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error,
+ bool was_async);
+
+If a termination function is given, the operation will be run asynchronously
+and the termination function will be called upon completion. If not given, the
+operation will be run synchronously. Note that in the asynchronous case, it is
+possible for the operation to complete before the function returns.
+
+Both the read and write functions end the operation when they complete,
+detaching any pinned resources.
+
+The read operation will fail with ESTALE if invalidation occurred whilst the
+operation was ongoing.
+
+
+Data File Coherency
+===================
+
+To request an update of the coherency data and file size on a cookie, the
+following should be called::
+
+ void fscache_update_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
+ const loff_t *object_size);
+
+This will update the cookie's coherency data and/or file size.
+
+
+Data File Invalidation
+======================
+
+Sometimes it will be necessary to invalidate an object that contains data.
+Typically this will be necessary when the server informs the network filesystem
+of a remote third-party change - at which point the filesystem has to throw
+away the state and cached data that it had for an file and reload from the
+server.
+
+To indicate that a cache object should be invalidated, the following should be
+called::
+
+ void fscache_invalidate(struct fscache_cookie *cookie,
+ const void *aux_data,
+ loff_t size,
+ unsigned int flags);
+
+This increases the invalidation counter in the cookie to cause outstanding
+reads to fail with -ESTALE, sets the coherency data and file size from the
+information supplied, blocks new I/O on the cookie and dispatches the cache to
+go and get rid of the old data.
+
+Invalidation runs asynchronously in a worker thread so that it doesn't block
+too much.
+
+
+Write-Back Resource Management
+==============================
+
+To write data to the cache from network filesystem writeback, the cache
+resources required need to be pinned at the point the modification is made (for
+instance when the page is marked dirty) as it's not possible to open a file in
+a thread that's exiting.
+
+The following facilities are provided to manage this:
+
+ * An inode flag, ``I_PINNING_FSCACHE_WB``, is provided to indicate that an
+ in-use is held on the cookie for this inode. It can only be changed if the
+ the inode lock is held.
+
+ * A flag, ``unpinned_fscache_wb`` is placed in the ``writeback_control``
+ struct that gets set if ``__writeback_single_inode()`` clears
+ ``I_PINNING_FSCACHE_WB`` because all the dirty pages were cleared.
+
+To support this, the following functions are provided::
+
+ bool fscache_dirty_folio(struct address_space *mapping,
+ struct folio *folio,
+ struct fscache_cookie *cookie);
+ void fscache_unpin_writeback(struct writeback_control *wbc,
+ struct fscache_cookie *cookie);
+ void fscache_clear_inode_writeback(struct fscache_cookie *cookie,
+ struct inode *inode,
+ const void *aux);
+
+The *set* function is intended to be called from the filesystem's
+``dirty_folio`` address space operation. If ``I_PINNING_FSCACHE_WB`` is not
+set, it sets that flag and increments the use count on the cookie (the caller
+must already have called ``fscache_use_cookie()``).
+
+The *unpin* function is intended to be called from the filesystem's
+``write_inode`` superblock operation. It cleans up after writing by unusing
+the cookie if unpinned_fscache_wb is set in the writeback_control struct.
+
+The *clear* function is intended to be called from the netfs's ``evict_inode``
+superblock operation. It must be called *after*
+``truncate_inode_pages_final()``, but *before* ``clear_inode()``. This cleans
+up any hanging ``I_PINNING_FSCACHE_WB``. It also allows the coherency data to
+be updated.
+
+
+Caching of Local Modifications
+==============================
+
+If a network filesystem has locally modified data that it wants to write to the
+cache, it needs to mark the pages to indicate that a write is in progress, and
+if the mark is already present, it needs to wait for it to be removed first
+(presumably due to an already in-progress operation). This prevents multiple
+competing DIO writes to the same storage in the cache.
+
+Firstly, the netfs should determine if caching is available by doing something
+like::
+
+ bool caching = fscache_cookie_enabled(cookie);
+
+If caching is to be attempted, pages should be waited for and then marked using
+the following functions provided by the netfs helper library::
+
+ void set_page_fscache(struct page *page);
+ void wait_on_page_fscache(struct page *page);
+ int wait_on_page_fscache_killable(struct page *page);
+
+Once all the pages in the span are marked, the netfs can ask fscache to
+schedule a write of that region::
+
+ void fscache_write_to_cache(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ loff_t start, size_t len, loff_t i_size,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv,
+ bool caching)
+
+And if an error occurs before that point is reached, the marks can be removed
+by calling::
+
+ void fscache_clear_page_bits(struct address_space *mapping,
+ loff_t start, size_t len,
+ bool caching)
+
+In these functions, a pointer to the mapping to which the source pages are
+attached is passed in and start and len indicate the size of the region that's
+going to be written (it doesn't have to align to page boundaries necessarily,
+but it does have to align to DIO boundaries on the backing filesystem). The
+caching parameter indicates if caching should be skipped, and if false, the
+functions do nothing.
+
+The write function takes some additional parameters: the cookie representing
+the cache object to be written to, i_size indicates the size of the netfs file
+and term_func indicates an optional completion function, to which
+term_func_priv will be passed, along with the error or amount written.
+
+Note that the write function will always run asynchronously and will unmark all
+the pages upon completion before calling term_func.
+
+
+Page Release and Invalidation
+=============================
+
+Fscache keeps track of whether we have any data in the cache yet for a cache
+object we've just created. It knows it doesn't have to do any reading until it
+has done a write and then the page it wrote from has been released by the VM,
+after which it *has* to look in the cache.
+
+To inform fscache that a page might now be in the cache, the following function
+should be called from the ``release_folio`` address space op::
+
+ void fscache_note_page_release(struct fscache_cookie *cookie);
+
+if the page has been released (ie. release_folio returned true).
+
+Page release and page invalidation should also wait for any mark left on the
+page to say that a DIO write is underway from that page::
+
+ void wait_on_page_fscache(struct page *page);
+ int wait_on_page_fscache_killable(struct page *page);
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: include/linux/fscache.h
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
deleted file mode 100644
index ba968e8f5704..000000000000
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ /dev/null
@@ -1,910 +0,0 @@
- ===============================
- FS-CACHE NETWORK FILESYSTEM API
- ===============================
-
-There's an API by which a network filesystem can make use of the FS-Cache
-facilities. This is based around a number of principles:
-
- (1) Caches can store a number of different object types. There are two main
- object types: indices and files. The first is a special type used by
- FS-Cache to make finding objects faster and to make retiring of groups of
- objects easier.
-
- (2) Every index, file or other object is represented by a cookie. This cookie
- may or may not have anything associated with it, but the netfs doesn't
- need to care.
-
- (3) Barring the top-level index (one entry per cached netfs), the index
- hierarchy for each netfs is structured according the whim of the netfs.
-
-This API is declared in <linux/fscache.h>.
-
-This document contains the following sections:
-
- (1) Network filesystem definition
- (2) Index definition
- (3) Object definition
- (4) Network filesystem (un)registration
- (5) Cache tag lookup
- (6) Index registration
- (7) Data file registration
- (8) Miscellaneous object registration
- (9) Setting the data file size
- (10) Page alloc/read/write
- (11) Page uncaching
- (12) Index and data file consistency
- (13) Cookie enablement
- (14) Miscellaneous cookie operations
- (15) Cookie unregistration
- (16) Index invalidation
- (17) Data file invalidation
- (18) FS-Cache specific page flags.
-
-
-=============================
-NETWORK FILESYSTEM DEFINITION
-=============================
-
-FS-Cache needs a description of the network filesystem. This is specified
-using a record of the following structure:
-
- struct fscache_netfs {
- uint32_t version;
- const char *name;
- struct fscache_cookie *primary_index;
- ...
- };
-
-This first two fields should be filled in before registration, and the third
-will be filled in by the registration function; any other fields should just be
-ignored and are for internal use only.
-
-The fields are:
-
- (1) The name of the netfs (used as the key in the toplevel index).
-
- (2) The version of the netfs (if the name matches but the version doesn't, the
- entire in-cache hierarchy for this netfs will be scrapped and begun
- afresh).
-
- (3) The cookie representing the primary index will be allocated according to
- another parameter passed into the registration function.
-
-For example, kAFS (linux/fs/afs/) uses the following definitions to describe
-itself:
-
- struct fscache_netfs afs_cache_netfs = {
- .version = 0,
- .name = "afs",
- };
-
-
-================
-INDEX DEFINITION
-================
-
-Indices are used for two purposes:
-
- (1) To aid the finding of a file based on a series of keys (such as AFS's
- "cell", "volume ID", "vnode ID").
-
- (2) To make it easier to discard a subset of all the files cached based around
- a particular key - for instance to mirror the removal of an AFS volume.
-
-However, since it's unlikely that any two netfs's are going to want to define
-their index hierarchies in quite the same way, FS-Cache tries to impose as few
-restraints as possible on how an index is structured and where it is placed in
-the tree. The netfs can even mix indices and data files at the same level, but
-it's not recommended.
-
-Each index entry consists of a key of indeterminate length plus some auxiliary
-data, also of indeterminate length.
-
-There are some limits on indices:
-
- (1) Any index containing non-index objects should be restricted to a single
- cache. Any such objects created within an index will be created in the
- first cache only. The cache in which an index is created can be
- controlled by cache tags (see below).
-
- (2) The entry data must be atomically journallable, so it is limited to about
- 400 bytes at present. At least 400 bytes will be available.
-
- (3) The depth of the index tree should be judged with care as the search
- function is recursive. Too many layers will run the kernel out of stack.
-
-
-=================
-OBJECT DEFINITION
-=================
-
-To define an object, a structure of the following type should be filled out:
-
- struct fscache_cookie_def
- {
- uint8_t name[16];
- uint8_t type;
-
- struct fscache_cache_tag *(*select_cache)(
- const void *parent_netfs_data,
- const void *cookie_netfs_data);
-
- enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
- const void *data,
- uint16_t datalen,
- loff_t object_size);
-
- void (*get_context)(void *cookie_netfs_data, void *context);
-
- void (*put_context)(void *cookie_netfs_data, void *context);
-
- void (*mark_pages_cached)(void *cookie_netfs_data,
- struct address_space *mapping,
- struct pagevec *cached_pvec);
- };
-
-This has the following fields:
-
- (1) The type of the object [mandatory].
-
- This is one of the following values:
-
- (*) FSCACHE_COOKIE_TYPE_INDEX
-
- This defines an index, which is a special FS-Cache type.
-
- (*) FSCACHE_COOKIE_TYPE_DATAFILE
-
- This defines an ordinary data file.
-
- (*) Any other value between 2 and 255
-
- This defines an extraordinary object such as an XATTR.
-
- (2) The name of the object type (NUL terminated unless all 16 chars are used)
- [optional].
-
- (3) A function to select the cache in which to store an index [optional].
-
- This function is invoked when an index needs to be instantiated in a cache
- during the instantiation of a non-index object. Only the immediate index
- parent for the non-index object will be queried. Any indices above that
- in the hierarchy may be stored in multiple caches. This function does not
- need to be supplied for any non-index object or any index that will only
- have index children.
-
- If this function is not supplied or if it returns NULL then the first
- cache in the parent's list will be chosen, or failing that, the first
- cache in the master list.
-
- (4) A function to check the auxiliary data [optional].
-
- This function will be called to check that a match found in the cache for
- this object is valid. For instance with AFS it could check the auxiliary
- data against the data version number returned by the server to determine
- whether the index entry in a cache is still valid.
-
- If this function is absent, it will be assumed that matching objects in a
- cache are always valid.
-
- The function is also passed the cache's idea of the object size and may
- use this to manage coherency also.
-
- If present, the function should return one of the following values:
-
- (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
- (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
- (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
-
- This function can also be used to extract data from the auxiliary data in
- the cache and copy it into the netfs's structures.
-
- (5) A pair of functions to manage contexts for the completion callback
- [optional].
-
- The cache read/write functions are passed a context which is then passed
- to the I/O completion callback function. To ensure this context remains
- valid until after the I/O completion is called, two functions may be
- provided: one to get an extra reference on the context, and one to drop a
- reference to it.
-
- If the context is not used or is a type of object that won't go out of
- scope, then these functions are not required. These functions are not
- required for indices as indices may not contain data. These functions may
- be called in interrupt context and so may not sleep.
-
- (6) A function to mark a page as retaining cache metadata [optional].
-
- This is called by the cache to indicate that it is retaining in-memory
- information for this page and that the netfs should uncache the page when
- it has finished. This does not indicate whether there's data on the disk
- or not. Note that several pages at once may be presented for marking.
-
- The PG_fscache bit is set on the pages before this function would be
- called, so the function need not be provided if this is sufficient.
-
- This function is not required for indices as they're not permitted data.
-
- (7) A function to unmark all the pages retaining cache metadata [mandatory].
-
- This is called by FS-Cache to indicate that a backing store is being
- unbound from a cookie and that all the marks on the pages should be
- cleared to prevent confusion. Note that the cache will have torn down all
- its tracking information so that the pages don't need to be explicitly
- uncached.
-
- This function is not required for indices as they're not permitted data.
-
-
-===================================
-NETWORK FILESYSTEM (UN)REGISTRATION
-===================================
-
-The first step is to declare the network filesystem to the cache. This also
-involves specifying the layout of the primary index (for AFS, this would be the
-"cell" level).
-
-The registration function is:
-
- int fscache_register_netfs(struct fscache_netfs *netfs);
-
-It just takes a pointer to the netfs definition. It returns 0 or an error as
-appropriate.
-
-For kAFS, registration is done as follows:
-
- ret = fscache_register_netfs(&afs_cache_netfs);
-
-The last step is, of course, unregistration:
-
- void fscache_unregister_netfs(struct fscache_netfs *netfs);
-
-
-================
-CACHE TAG LOOKUP
-================
-
-FS-Cache permits the use of more than one cache. To permit particular index
-subtrees to be bound to particular caches, the second step is to look up cache
-representation tags. This step is optional; it can be left entirely up to
-FS-Cache as to which cache should be used. The problem with doing that is that
-FS-Cache will always pick the first cache that was registered.
-
-To get the representation for a named tag:
-
- struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
-
-This takes a text string as the name and returns a representation of a tag. It
-will never return an error. It may return a dummy tag, however, if it runs out
-of memory; this will inhibit caching with this tag.
-
-Any representation so obtained must be released by passing it to this function:
-
- void fscache_release_cache_tag(struct fscache_cache_tag *tag);
-
-The tag will be retrieved by FS-Cache when it calls the object definition
-operation select_cache().
-
-
-==================
-INDEX REGISTRATION
-==================
-
-The third step is to inform FS-Cache about part of an index hierarchy that can
-be used to locate files. This is done by requesting a cookie for each index in
-the path to the file:
-
- struct fscache_cookie *
- fscache_acquire_cookie(struct fscache_cookie *parent,
- const struct fscache_object_def *def,
- const void *index_key,
- size_t index_key_len,
- const void *aux_data,
- size_t aux_data_len,
- void *netfs_data,
- loff_t object_size,
- bool enable);
-
-This function creates an index entry in the index represented by parent,
-filling in the index entry by calling the operations pointed to by def.
-
-A unique key that represents the object within the parent must be pointed to by
-index_key and is of length index_key_len.
-
-An optional blob of auxiliary data that is to be stored within the cache can be
-pointed to with aux_data and should be of length aux_data_len. This would
-typically be used for storing coherency data.
-
-The netfs may pass an arbitrary value in netfs_data and this will be presented
-to it in the event of any calling back. This may also be used in tracing or
-logging of messages.
-
-The cache tracks the size of the data attached to an object and this set to be
-object_size. For indices, this should be 0. This value will be passed to the
-->check_aux() callback.
-
-Note that this function never returns an error - all errors are handled
-internally. It may, however, return NULL to indicate no cookie. It is quite
-acceptable to pass this token back to this function as the parent to another
-acquisition (or even to the relinquish cookie, read page and write page
-functions - see below).
-
-Note also that no indices are actually created in a cache until a non-index
-object needs to be created somewhere down the hierarchy. Furthermore, an index
-may be created in several different caches independently at different times.
-This is all handled transparently, and the netfs doesn't see any of it.
-
-A cookie will be created in the disabled state if enabled is false. A cookie
-must be enabled to do anything with it. A disabled cookie can be enabled by
-calling fscache_enable_cookie() (see below).
-
-For example, with AFS, a cell would be added to the primary index. This index
-entry would have a dependent inode containing volume mappings within this cell:
-
- cell->cache =
- fscache_acquire_cookie(afs_cache_netfs.primary_index,
- &afs_cell_cache_index_def,
- cell->name, strlen(cell->name),
- NULL, 0,
- cell, 0, true);
-
-And then a particular volume could be added to that index by ID, creating
-another index for vnodes (AFS inode equivalents):
-
- volume->cache =
- fscache_acquire_cookie(volume->cell->cache,
- &afs_volume_cache_index_def,
- &volume->vid, sizeof(volume->vid),
- NULL, 0,
- volume, 0, true);
-
-
-======================
-DATA FILE REGISTRATION
-======================
-
-The fourth step is to request a data file be created in the cache. This is
-identical to index cookie acquisition. The only difference is that the type in
-the object definition should be something other than index type.
-
- vnode->cache =
- fscache_acquire_cookie(volume->cache,
- &afs_vnode_cache_object_def,
- &key, sizeof(key),
- &aux, sizeof(aux),
- vnode, vnode->status.size, true);
-
-
-=================================
-MISCELLANEOUS OBJECT REGISTRATION
-=================================
-
-An optional step is to request an object of miscellaneous type be created in
-the cache. This is almost identical to index cookie acquisition. The only
-difference is that the type in the object definition should be something other
-than index type. While the parent object could be an index, it's more likely
-it would be some other type of object such as a data file.
-
- xattr->cache =
- fscache_acquire_cookie(vnode->cache,
- &afs_xattr_cache_object_def,
- &xattr->name, strlen(xattr->name),
- NULL, 0,
- xattr, strlen(xattr->val), true);
-
-Miscellaneous objects might be used to store extended attributes or directory
-entries for example.
-
-
-==========================
-SETTING THE DATA FILE SIZE
-==========================
-
-The fifth step is to set the physical attributes of the file, such as its size.
-This doesn't automatically reserve any space in the cache, but permits the
-cache to adjust its metadata for data tracking appropriately:
-
- int fscache_attr_changed(struct fscache_cookie *cookie);
-
-The cache will return -ENOBUFS if there is no backing cache or if there is no
-space to allocate any extra metadata required in the cache.
-
-Note that attempts to read or write data pages in the cache over this size may
-be rebuffed with -ENOBUFS.
-
-This operation schedules an attribute adjustment to happen asynchronously at
-some point in the future, and as such, it may happen after the function returns
-to the caller. The attribute adjustment excludes read and write operations.
-
-
-=====================
-PAGE ALLOC/READ/WRITE
-=====================
-
-And the sixth step is to store and retrieve pages in the cache. There are
-three functions that are used to do this.
-
-Note:
-
- (1) A page should not be re-read or re-allocated without uncaching it first.
-
- (2) A read or allocated page must be uncached when the netfs page is released
- from the pagecache.
-
- (3) A page should only be written to the cache if previous read or allocated.
-
-This permits the cache to maintain its page tracking in proper order.
-
-
-PAGE READ
----------
-
-Firstly, the netfs should ask FS-Cache to examine the caches and read the
-contents cached for a particular page of a particular file if present, or else
-allocate space to store the contents if not:
-
- typedef
- void (*fscache_rw_complete_t)(struct page *page,
- void *context,
- int error);
-
- int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
- struct page *page,
- fscache_rw_complete_t end_io_func,
- void *context,
- gfp_t gfp);
-
-The cookie argument must specify a cookie for an object that isn't an index,
-the page specified will have the data loaded into it (and is also used to
-specify the page number), and the gfp argument is used to control how any
-memory allocations made are satisfied.
-
-If the cookie indicates the inode is not cached:
-
- (1) The function will return -ENOBUFS.
-
-Else if there's a copy of the page resident in the cache:
-
- (1) The mark_pages_cached() cookie operation will be called on that page.
-
- (2) The function will submit a request to read the data from the cache's
- backing device directly into the page specified.
-
- (3) The function will return 0.
-
- (4) When the read is complete, end_io_func() will be invoked with:
-
- (*) The netfs data supplied when the cookie was created.
-
- (*) The page descriptor.
-
- (*) The context argument passed to the above function. This will be
- maintained with the get_context/put_context functions mentioned above.
-
- (*) An argument that's 0 on success or negative for an error code.
-
- If an error occurs, it should be assumed that the page contains no usable
- data. fscache_readpages_cancel() may need to be called.
-
- end_io_func() will be called in process context if the read is results in
- an error, but it might be called in interrupt context if the read is
- successful.
-
-Otherwise, if there's not a copy available in cache, but the cache may be able
-to store the page:
-
- (1) The mark_pages_cached() cookie operation will be called on that page.
-
- (2) A block may be reserved in the cache and attached to the object at the
- appropriate place.
-
- (3) The function will return -ENODATA.
-
-This function may also return -ENOMEM or -EINTR, in which case it won't have
-read any data from the cache.
-
-
-PAGE ALLOCATE
--------------
-
-Alternatively, if there's not expected to be any data in the cache for a page
-because the file has been extended, a block can simply be allocated instead:
-
- int fscache_alloc_page(struct fscache_cookie *cookie,
- struct page *page,
- gfp_t gfp);
-
-This is similar to the fscache_read_or_alloc_page() function, except that it
-never reads from the cache. It will return 0 if a block has been allocated,
-rather than -ENODATA as the other would. One or the other must be performed
-before writing to the cache.
-
-The mark_pages_cached() cookie operation will be called on the page if
-successful.
-
-
-PAGE WRITE
-----------
-
-Secondly, if the netfs changes the contents of the page (either due to an
-initial download or if a user performs a write), then the page should be
-written back to the cache:
-
- int fscache_write_page(struct fscache_cookie *cookie,
- struct page *page,
- loff_t object_size,
- gfp_t gfp);
-
-The cookie argument must specify a data file cookie, the page specified should
-contain the data to be written (and is also used to specify the page number),
-object_size is the revised size of the object and the gfp argument is used to
-control how any memory allocations made are satisfied.
-
-The page must have first been read or allocated successfully and must not have
-been uncached before writing is performed.
-
-If the cookie indicates the inode is not cached then:
-
- (1) The function will return -ENOBUFS.
-
-Else if space can be allocated in the cache to hold this page:
-
- (1) PG_fscache_write will be set on the page.
-
- (2) The function will submit a request to write the data to cache's backing
- device directly from the page specified.
-
- (3) The function will return 0.
-
- (4) When the write is complete PG_fscache_write is cleared on the page and
- anyone waiting for that bit will be woken up.
-
-Else if there's no space available in the cache, -ENOBUFS will be returned. It
-is also possible for the PG_fscache_write bit to be cleared when no write took
-place if unforeseen circumstances arose (such as a disk error).
-
-Writing takes place asynchronously.
-
-
-MULTIPLE PAGE READ
-------------------
-
-A facility is provided to read several pages at once, as requested by the
-readpages() address space operation:
-
- int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
- struct address_space *mapping,
- struct list_head *pages,
- int *nr_pages,
- fscache_rw_complete_t end_io_func,
- void *context,
- gfp_t gfp);
-
-This works in a similar way to fscache_read_or_alloc_page(), except:
-
- (1) Any page it can retrieve data for is removed from pages and nr_pages and
- dispatched for reading to the disk. Reads of adjacent pages on disk may
- be merged for greater efficiency.
-
- (2) The mark_pages_cached() cookie operation will be called on several pages
- at once if they're being read or allocated.
-
- (3) If there was an general error, then that error will be returned.
-
- Else if some pages couldn't be allocated or read, then -ENOBUFS will be
- returned.
-
- Else if some pages couldn't be read but were allocated, then -ENODATA will
- be returned.
-
- Otherwise, if all pages had reads dispatched, then 0 will be returned, the
- list will be empty and *nr_pages will be 0.
-
- (4) end_io_func will be called once for each page being read as the reads
- complete. It will be called in process context if error != 0, but it may
- be called in interrupt context if there is no error.
-
-Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
-some of the pages being read and some being allocated. Those pages will have
-been marked appropriately and will need uncaching.
-
-
-CANCELLATION OF UNREAD PAGES
-----------------------------
-
-If one or more pages are passed to fscache_read_or_alloc_pages() but not then
-read from the cache and also not read from the underlying filesystem then
-those pages will need to have any marks and reservations removed. This can be
-done by calling:
-
- void fscache_readpages_cancel(struct fscache_cookie *cookie,
- struct list_head *pages);
-
-prior to returning to the caller. The cookie argument should be as passed to
-fscache_read_or_alloc_pages(). Every page in the pages list will be examined
-and any that have PG_fscache set will be uncached.
-
-
-==============
-PAGE UNCACHING
-==============
-
-To uncache a page, this function should be called:
-
- void fscache_uncache_page(struct fscache_cookie *cookie,
- struct page *page);
-
-This function permits the cache to release any in-memory representation it
-might be holding for this netfs page. This function must be called once for
-each page on which the read or write page functions above have been called to
-make sure the cache's in-memory tracking information gets torn down.
-
-Note that pages can't be explicitly deleted from the a data file. The whole
-data file must be retired (see the relinquish cookie function below).
-
-Furthermore, note that this does not cancel the asynchronous read or write
-operation started by the read/alloc and write functions, so the page
-invalidation functions must use:
-
- bool fscache_check_page_write(struct fscache_cookie *cookie,
- struct page *page);
-
-to see if a page is being written to the cache, and:
-
- void fscache_wait_on_page_write(struct fscache_cookie *cookie,
- struct page *page);
-
-to wait for it to finish if it is.
-
-
-When releasepage() is being implemented, a special FS-Cache function exists to
-manage the heuristics of coping with vmscan trying to eject pages, which may
-conflict with the cache trying to write pages to the cache (which may itself
-need to allocate memory):
-
- bool fscache_maybe_release_page(struct fscache_cookie *cookie,
- struct page *page,
- gfp_t gfp);
-
-This takes the netfs cookie, and the page and gfp arguments as supplied to
-releasepage(). It will return false if the page cannot be released yet for
-some reason and if it returns true, the page has been uncached and can now be
-released.
-
-To make a page available for release, this function may wait for an outstanding
-storage request to complete, or it may attempt to cancel the storage request -
-in which case the page will not be stored in the cache this time.
-
-
-BULK INODE PAGE UNCACHE
------------------------
-
-A convenience routine is provided to perform an uncache on all the pages
-attached to an inode. This assumes that the pages on the inode correspond on a
-1:1 basis with the pages in the cache.
-
- void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
- struct inode *inode);
-
-This takes the netfs cookie that the pages were cached with and the inode that
-the pages are attached to. This function will wait for pages to finish being
-written to the cache and for the cache to finish with the page generally. No
-error is returned.
-
-
-===============================
-INDEX AND DATA FILE CONSISTENCY
-===============================
-
-To find out whether auxiliary data for an object is up to data within the
-cache, the following function can be called:
-
- int fscache_check_consistency(struct fscache_cookie *cookie,
- const void *aux_data);
-
-This will call back to the netfs to check whether the auxiliary data associated
-with a cookie is correct; if aux_data is non-NULL, it will update the auxiliary
-data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also
-return -ENOMEM and -ERESTARTSYS.
-
-To request an update of the index data for an index or other object, the
-following function should be called:
-
- void fscache_update_cookie(struct fscache_cookie *cookie,
- const void *aux_data);
-
-This function will update the cookie's auxiliary data buffer from aux_data if
-that is non-NULL and then schedule this to be stored on disk. The update
-method in the parent index definition will be called to transfer the data.
-
-Note that partial updates may happen automatically at other times, such as when
-data blocks are added to a data file object.
-
-
-=================
-COOKIE ENABLEMENT
-=================
-
-Cookies exist in one of two states: enabled and disabled. If a cookie is
-disabled, it ignores all attempts to acquire child cookies; check, update or
-invalidate its state; allocate, read or write backing pages - though it is
-still possible to uncache pages and relinquish the cookie.
-
-The initial enablement state is set by fscache_acquire_cookie(), but the cookie
-can be enabled or disabled later. To disable a cookie, call:
-
- void fscache_disable_cookie(struct fscache_cookie *cookie,
- const void *aux_data,
- bool invalidate);
-
-If the cookie is not already disabled, this locks the cookie against other
-enable and disable ops, marks the cookie as being disabled, discards or
-invalidates any backing objects and waits for cessation of activity on any
-associated object before unlocking the cookie.
-
-All possible failures are handled internally. The caller should consider
-calling fscache_uncache_all_inode_pages() afterwards to make sure all page
-markings are cleared up.
-
-Cookies can be enabled or reenabled with:
-
- void fscache_enable_cookie(struct fscache_cookie *cookie,
- const void *aux_data,
- loff_t object_size,
- bool (*can_enable)(void *data),
- void *data)
-
-If the cookie is not already enabled, this locks the cookie against other
-enable and disable ops, invokes can_enable() and, if the cookie is not an index
-cookie, will begin the procedure of acquiring backing objects.
-
-The optional can_enable() function is passed the data argument and returns a
-ruling as to whether or not enablement should actually be permitted to begin.
-
-All possible failures are handled internally. The cookie will only be marked
-as enabled if provisional backing objects are allocated.
-
-The object's data size is updated from object_size and is passed to the
-->check_aux() function.
-
-In both cases, the cookie's auxiliary data buffer is updated from aux_data if
-that is non-NULL inside the enablement lock before proceeding.
-
-
-===============================
-MISCELLANEOUS COOKIE OPERATIONS
-===============================
-
-There are a number of operations that can be used to control cookies:
-
- (*) Cookie pinning:
-
- int fscache_pin_cookie(struct fscache_cookie *cookie);
- void fscache_unpin_cookie(struct fscache_cookie *cookie);
-
- These operations permit data cookies to be pinned into the cache and to
- have the pinning removed. They are not permitted on index cookies.
-
- The pinning function will return 0 if successful, -ENOBUFS in the cookie
- isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
- -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
- -EIO if there's any other problem.
-
- (*) Data space reservation:
-
- int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
-
- This permits a netfs to request cache space be reserved to store up to the
- given amount of a file. It is permitted to ask for more than the current
- size of the file to allow for future file expansion.
-
- If size is given as zero then the reservation will be cancelled.
-
- The function will return 0 if successful, -ENOBUFS in the cookie isn't
- backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
- -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
- -EIO if there's any other problem.
-
- Note that this doesn't pin an object in a cache; it can still be culled to
- make space if it's not in use.
-
-
-=====================
-COOKIE UNREGISTRATION
-=====================
-
-To get rid of a cookie, this function should be called.
-
- void fscache_relinquish_cookie(struct fscache_cookie *cookie,
- const void *aux_data,
- bool retire);
-
-If retire is non-zero, then the object will be marked for recycling, and all
-copies of it will be removed from all active caches in which it is present.
-Not only that but all child objects will also be retired.
-
-If retire is zero, then the object may be available again when next the
-acquisition function is called. Retirement here will overrule the pinning on a
-cookie.
-
-The cookie's auxiliary data will be updated from aux_data if that is non-NULL
-so that the cache can lazily update it on disk.
-
-One very important note - relinquish must NOT be called for a cookie unless all
-the cookies for "child" indices, objects and pages have been relinquished
-first.
-
-
-==================
-INDEX INVALIDATION
-==================
-
-There is no direct way to invalidate an index subtree. To do this, the caller
-should relinquish and retire the cookie they have, and then acquire a new one.
-
-
-======================
-DATA FILE INVALIDATION
-======================
-
-Sometimes it will be necessary to invalidate an object that contains data.
-Typically this will be necessary when the server tells the netfs of a foreign
-change - at which point the netfs has to throw away all the state it had for an
-inode and reload from the server.
-
-To indicate that a cache object should be invalidated, the following function
-can be called:
-
- void fscache_invalidate(struct fscache_cookie *cookie);
-
-This can be called with spinlocks held as it defers the work to a thread pool.
-All extant storage, retrieval and attribute change ops at this point are
-cancelled and discarded. Some future operations will be rejected until the
-cache has had a chance to insert a barrier in the operations queue. After
-that, operations will be queued again behind the invalidation operation.
-
-The invalidation operation will perform an attribute change operation and an
-auxiliary data update operation as it is very likely these will have changed.
-
-Using the following function, the netfs can wait for the invalidation operation
-to have reached a point at which it can start submitting ordinary operations
-once again:
-
- void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
-
-
-===========================
-FS-CACHE SPECIFIC PAGE FLAG
-===========================
-
-FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
-given the alternative name PG_fscache.
-
-PG_fscache is used to indicate that the page is known by the cache, and that
-the cache must be informed if the page is going to go away. It's an indication
-to the netfs that the cache has an interest in this page, where an interest may
-be a pointer to it, resources allocated or reserved for it, or I/O in progress
-upon it.
-
-The netfs can use this information in methods such as releasepage() to
-determine whether it needs to uncache a page or update it.
-
-Furthermore, if this bit is set, releasepage() and invalidatepage() operations
-will be called on a page to get rid of it, even if PG_private is not set. This
-allows caching to attempted on a page before read_cache_pages() to be called
-after fscache_read_or_alloc_pages() as the former will try and release pages it
-was given under certain circumstances.
-
-This bit does not overlap with such as PG_private. This means that FS-Cache
-can be used with a filesystem that uses the block buffering code.
-
-There are a number of operations defined on this flag:
-
- int PageFsCache(struct page *page);
- void SetPageFsCache(struct page *page)
- void ClearPageFsCache(struct page *page)
- int TestSetPageFsCache(struct page *page)
- int TestClearPageFsCache(struct page *page)
-
-These functions are bit test, bit set, bit clear, bit test and set and bit
-test and clear operations on PG_fscache.
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt
deleted file mode 100644
index 100ff41127e4..000000000000
--- a/Documentation/filesystems/caching/object.txt
+++ /dev/null
@@ -1,320 +0,0 @@
- ====================================================
- IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
- ====================================================
-
-By: David Howells <dhowells@redhat.com>
-
-Contents:
-
- (*) Representation
-
- (*) Object management state machine.
-
- - Provision of cpu time.
- - Locking simplification.
-
- (*) The set of states.
-
- (*) The set of events.
-
-
-==============
-REPRESENTATION
-==============
-
-FS-Cache maintains an in-kernel representation of each object that a netfs is
-currently interested in. Such objects are represented by the fscache_cookie
-struct and are referred to as cookies.
-
-FS-Cache also maintains a separate in-kernel representation of the objects that
-a cache backend is currently actively caching. Such objects are represented by
-the fscache_object struct. The cache backends allocate these upon request, and
-are expected to embed them in their own representations. These are referred to
-as objects.
-
-There is a 1:N relationship between cookies and objects. A cookie may be
-represented by multiple objects - an index may exist in more than one cache -
-or even by no objects (it may not be cached).
-
-Furthermore, both cookies and objects are hierarchical. The two hierarchies
-correspond, but the cookies tree is a superset of the union of the object trees
-of multiple caches:
-
- NETFS INDEX TREE : CACHE 1 : CACHE 2
- : :
- : +-----------+ :
- +----------->| IObject | :
- +-----------+ | : +-----------+ :
- | ICookie |-------+ : | :
- +-----------+ | : | : +-----------+
- | +------------------------------>| IObject |
- | : | : +-----------+
- | : V : |
- | : +-----------+ : |
- V +----------->| IObject | : |
- +-----------+ | : +-----------+ : |
- | ICookie |-------+ : | : V
- +-----------+ | : | : +-----------+
- | +------------------------------>| IObject |
- +-----+-----+ : | : +-----------+
- | | : | : |
- V | : V : |
- +-----------+ | : +-----------+ : |
- | ICookie |------------------------->| IObject | : |
- +-----------+ | : +-----------+ : |
- | V : | : V
- | +-----------+ : | : +-----------+
- | | ICookie |-------------------------------->| IObject |
- | +-----------+ : | : +-----------+
- V | : V : |
- +-----------+ | : +-----------+ : |
- | DCookie |------------------------->| DObject | : |
- +-----------+ | : +-----------+ : |
- | : : |
- +-------+-------+ : : |
- | | : : |
- V V : : V
- +-----------+ +-----------+ : : +-----------+
- | DCookie | | DCookie |------------------------>| DObject |
- +-----------+ +-----------+ : : +-----------+
- : :
-
-In the above illustration, ICookie and IObject represent indices and DCookie
-and DObject represent data storage objects. Indices may have representation in
-multiple caches, but currently, non-index objects may not. Objects of any type
-may also be entirely unrepresented.
-
-As far as the netfs API goes, the netfs is only actually permitted to see
-pointers to the cookies. The cookies themselves and any objects attached to
-those cookies are hidden from it.
-
-
-===============================
-OBJECT MANAGEMENT STATE MACHINE
-===============================
-
-Within FS-Cache, each active object is managed by its own individual state
-machine. The state for an object is kept in the fscache_object struct, in
-object->state. A cookie may point to a set of objects that are in different
-states.
-
-Each state has an action associated with it that is invoked when the machine
-wakes up in that state. There are four logical sets of states:
-
- (1) Preparation: states that wait for the parent objects to become ready. The
- representations are hierarchical, and it is expected that an object must
- be created or accessed with respect to its parent object.
-
- (2) Initialisation: states that perform lookups in the cache and validate
- what's found and that create on disk any missing metadata.
-
- (3) Normal running: states that allow netfs operations on objects to proceed
- and that update the state of objects.
-
- (4) Termination: states that detach objects from their netfs cookies, that
- delete objects from disk, that handle disk and system errors and that free
- up in-memory resources.
-
-
-In most cases, transitioning between states is in response to signalled events.
-When a state has finished processing, it will usually set the mask of events in
-which it is interested (object->event_mask) and relinquish the worker thread.
-Then when an event is raised (by calling fscache_raise_event()), if the event
-is not masked, the object will be queued for processing (by calling
-fscache_enqueue_object()).
-
-
-PROVISION OF CPU TIME
----------------------
-
-The work to be done by the various states was given CPU time by the threads of
-the slow work facility. This was used in preference to the workqueue facility
-because:
-
- (1) Threads may be completely occupied for very long periods of time by a
- particular work item. These state actions may be doing sequences of
- synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
- getxattr, truncate, unlink, rmdir, rename).
-
- (2) Threads may do little actual work, but may rather spend a lot of time
- sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
- workqueues don't necessarily have the right numbers of threads.
-
-
-LOCKING SIMPLIFICATION
-----------------------
-
-Because only one worker thread may be operating on any particular object's
-state machine at once, this simplifies the locking, particularly with respect
-to disconnecting the netfs's representation of a cache object (fscache_cookie)
-from the cache backend's representation (fscache_object) - which may be
-requested from either end.
-
-
-=================
-THE SET OF STATES
-=================
-
-The object state machine has a set of states that it can be in. There are
-preparation states in which the object sets itself up and waits for its parent
-object to transit to a state that allows access to its children:
-
- (1) State FSCACHE_OBJECT_INIT.
-
- Initialise the object and wait for the parent object to become active. In
- the cache, it is expected that it will not be possible to look an object
- up from the parent object, until that parent object itself has been looked
- up.
-
-There are initialisation states in which the object sets itself up and accesses
-disk for the object metadata:
-
- (2) State FSCACHE_OBJECT_LOOKING_UP.
-
- Look up the object on disk, using the parent as a starting point.
- FS-Cache expects the cache backend to probe the cache to see whether this
- object is represented there, and if it is, to see if it's valid (coherency
- management).
-
- The cache should call fscache_object_lookup_negative() to indicate lookup
- failure for whatever reason, and should call fscache_obtained_object() to
- indicate success.
-
- At the completion of lookup, FS-Cache will let the netfs go ahead with
- read operations, no matter whether the file is yet cached. If not yet
- cached, read operations will be immediately rejected with ENODATA until
- the first known page is uncached - as to that point there can be no data
- to be read out of the cache for that file that isn't currently also held
- in the pagecache.
-
- (3) State FSCACHE_OBJECT_CREATING.
-
- Create an object on disk, using the parent as a starting point. This
- happens if the lookup failed to find the object, or if the object's
- coherency data indicated what's on disk is out of date. In this state,
- FS-Cache expects the cache to create
-
- The cache should call fscache_obtained_object() if creation completes
- successfully, fscache_object_lookup_negative() otherwise.
-
- At the completion of creation, FS-Cache will start processing write
- operations the netfs has queued for an object. If creation failed, the
- write ops will be transparently discarded, and nothing recorded in the
- cache.
-
-There are some normal running states in which the object spends its time
-servicing netfs requests:
-
- (4) State FSCACHE_OBJECT_AVAILABLE.
-
- A transient state in which pending operations are started, child objects
- are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
- lookup data is freed.
-
- (5) State FSCACHE_OBJECT_ACTIVE.
-
- The normal running state. In this state, requests the netfs makes will be
- passed on to the cache.
-
- (6) State FSCACHE_OBJECT_INVALIDATING.
-
- The object is undergoing invalidation. When the state comes here, it
- discards all pending read, write and attribute change operations as it is
- going to clear out the cache entirely and reinitialise it. It will then
- continue to the FSCACHE_OBJECT_UPDATING state.
-
- (7) State FSCACHE_OBJECT_UPDATING.
-
- The state machine comes here to update the object in the cache from the
- netfs's records. This involves updating the auxiliary data that is used
- to maintain coherency.
-
-And there are terminal states in which an object cleans itself up, deallocates
-memory and potentially deletes stuff from disk:
-
- (8) State FSCACHE_OBJECT_LC_DYING.
-
- The object comes here if it is dying because of a lookup or creation
- error. This would be due to a disk error or system error of some sort.
- Temporary data is cleaned up, and the parent is released.
-
- (9) State FSCACHE_OBJECT_DYING.
-
- The object comes here if it is dying due to an error, because its parent
- cookie has been relinquished by the netfs or because the cache is being
- withdrawn.
-
- Any child objects waiting on this one are given CPU time so that they too
- can destroy themselves. This object waits for all its children to go away
- before advancing to the next state.
-
-(10) State FSCACHE_OBJECT_ABORT_INIT.
-
- The object comes to this state if it was waiting on its parent in
- FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
- so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
-
-(11) State FSCACHE_OBJECT_RELEASING.
-(12) State FSCACHE_OBJECT_RECYCLING.
-
- The object comes to one of these two states when dying once it is rid of
- all its children, if it is dying because the netfs relinquished its
- cookie. In the first state, the cached data is expected to persist, and
- in the second it will be deleted.
-
-(13) State FSCACHE_OBJECT_WITHDRAWING.
-
- The object transits to this state if the cache decides it wants to
- withdraw the object from service, perhaps to make space, but also due to
- error or just because the whole cache is being withdrawn.
-
-(14) State FSCACHE_OBJECT_DEAD.
-
- The object transits to this state when the in-memory object record is
- ready to be deleted. The object processor shouldn't ever see an object in
- this state.
-
-
-THE SET OF EVENTS
------------------
-
-There are a number of events that can be raised to an object state machine:
-
- (*) FSCACHE_OBJECT_EV_UPDATE
-
- The netfs requested that an object be updated. The state machine will ask
- the cache backend to update the object, and the cache backend will ask the
- netfs for details of the change through its cookie definition ops.
-
- (*) FSCACHE_OBJECT_EV_CLEARED
-
- This is signalled in two circumstances:
-
- (a) when an object's last child object is dropped and
-
- (b) when the last operation outstanding on an object is completed.
-
- This is used to proceed from the dying state.
-
- (*) FSCACHE_OBJECT_EV_ERROR
-
- This is signalled when an I/O error occurs during the processing of some
- object.
-
- (*) FSCACHE_OBJECT_EV_RELEASE
- (*) FSCACHE_OBJECT_EV_RETIRE
-
- These are signalled when the netfs relinquishes a cookie it was using.
- The event selected depends on whether the netfs asks for the backing
- object to be retired (deleted) or retained.
-
- (*) FSCACHE_OBJECT_EV_WITHDRAW
-
- This is signalled when the cache backend wants to withdraw an object.
- This means that the object will have to be detached from the netfs's
- cookie.
-
-Because the withdrawing releasing/retiring events are all handled by the object
-state machine, it doesn't matter if there's a collision with both ends trying
-to sever the connection at the same time. The state machine can just pick
-which one it wants to honour, and that effects the other.
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
deleted file mode 100644
index d8976c434718..000000000000
--- a/Documentation/filesystems/caching/operations.txt
+++ /dev/null
@@ -1,213 +0,0 @@
- ================================
- ASYNCHRONOUS OPERATIONS HANDLING
- ================================
-
-By: David Howells <dhowells@redhat.com>
-
-Contents:
-
- (*) Overview.
-
- (*) Operation record initialisation.
-
- (*) Parameters.
-
- (*) Procedure.
-
- (*) Asynchronous callback.
-
-
-========
-OVERVIEW
-========
-
-FS-Cache has an asynchronous operations handling facility that it uses for its
-data storage and retrieval routines. Its operations are represented by
-fscache_operation structs, though these are usually embedded into some other
-structure.
-
-This facility is available to and expected to be be used by the cache backends,
-and FS-Cache will create operations and pass them off to the appropriate cache
-backend for completion.
-
-To make use of this facility, <linux/fscache-cache.h> should be #included.
-
-
-===============================
-OPERATION RECORD INITIALISATION
-===============================
-
-An operation is recorded in an fscache_operation struct:
-
- struct fscache_operation {
- union {
- struct work_struct fast_work;
- struct slow_work slow_work;
- };
- unsigned long flags;
- fscache_operation_processor_t processor;
- ...
- };
-
-Someone wanting to issue an operation should allocate something with this
-struct embedded in it. They should initialise it by calling:
-
- void fscache_operation_init(struct fscache_operation *op,
- fscache_operation_release_t release);
-
-with the operation to be initialised and the release function to use.
-
-The op->flags parameter should be set to indicate the CPU time provision and
-the exclusivity (see the Parameters section).
-
-The op->fast_work, op->slow_work and op->processor flags should be set as
-appropriate for the CPU time provision (see the Parameters section).
-
-FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
-operation and waited for afterwards.
-
-
-==========
-PARAMETERS
-==========
-
-There are a number of parameters that can be set in the operation record's flag
-parameter. There are three options for the provision of CPU time in these
-operations:
-
- (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
- may decide it wants to handle an operation itself without deferring it to
- another thread.
-
- This is, for example, used in read operations for calling readpages() on
- the backing filesystem in CacheFiles. Although readpages() does an
- asynchronous data fetch, the determination of whether pages exist is done
- synchronously - and the netfs does not proceed until this has been
- determined.
-
- If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
- before submitting the operation, and the operating thread must wait for it
- to be cleared before proceeding:
-
- wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
- TASK_UNINTERRUPTIBLE);
-
-
- (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
- will be given to keventd to process. Such an operation is not permitted
- to sleep on I/O.
-
- This is, for example, used by CacheFiles to copy data from a backing fs
- page to a netfs page after the backing fs has read the page in.
-
- If this option is used, op->fast_work and op->processor must be
- initialised before submitting the operation:
-
- INIT_WORK(&op->fast_work, do_some_work);
-
-
- (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
- will be given to the slow work facility to process. Such an operation is
- permitted to sleep on I/O.
-
- This is, for example, used by FS-Cache to handle background writes of
- pages that have just been fetched from a remote server.
-
- If this option is used, op->slow_work and op->processor must be
- initialised before submitting the operation:
-
- fscache_operation_init_slow(op, processor)
-
-
-Furthermore, operations may be one of two types:
-
- (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
- conjunction with any other operation on the object being operated upon.
-
- An example of this is the attribute change operation, in which the file
- being written to may need truncation.
-
- (2) Shareable. Operations of this type may be running simultaneously. It's
- up to the operation implementation to prevent interference between other
- operations running at the same time.
-
-
-=========
-PROCEDURE
-=========
-
-Operations are used through the following procedure:
-
- (1) The submitting thread must allocate the operation and initialise it
- itself. Normally this would be part of a more specific structure with the
- generic op embedded within.
-
- (2) The submitting thread must then submit the operation for processing using
- one of the following two functions:
-
- int fscache_submit_op(struct fscache_object *object,
- struct fscache_operation *op);
-
- int fscache_submit_exclusive_op(struct fscache_object *object,
- struct fscache_operation *op);
-
- The first function should be used to submit non-exclusive ops and the
- second to submit exclusive ones. The caller must still set the
- FSCACHE_OP_EXCLUSIVE flag.
-
- If successful, both functions will assign the operation to the specified
- object and return 0. -ENOBUFS will be returned if the object specified is
- permanently unavailable.
-
- The operation manager will defer operations on an object that is still
- undergoing lookup or creation. The operation will also be deferred if an
- operation of conflicting exclusivity is in progress on the object.
-
- If the operation is asynchronous, the manager will retain a reference to
- it, so the caller should put their reference to it by passing it to:
-
- void fscache_put_operation(struct fscache_operation *op);
-
- (3) If the submitting thread wants to do the work itself, and has marked the
- operation with FSCACHE_OP_MYTHREAD, then it should monitor
- FSCACHE_OP_WAITING as described above and check the state of the object if
- necessary (the object might have died while the thread was waiting).
-
- When it has finished doing its processing, it should call
- fscache_op_complete() and fscache_put_operation() on it.
-
- (4) The operation holds an effective lock upon the object, preventing other
- exclusive ops conflicting until it is released. The operation can be
- enqueued for further immediate asynchronous processing by adjusting the
- CPU time provisioning option if necessary, eg:
-
- op->flags &= ~FSCACHE_OP_TYPE;
- op->flags |= ~FSCACHE_OP_FAST;
-
- and calling:
-
- void fscache_enqueue_operation(struct fscache_operation *op)
-
- This can be used to allow other things to have use of the worker thread
- pools.
-
-
-=====================
-ASYNCHRONOUS CALLBACK
-=====================
-
-When used in asynchronous mode, the worker thread pool will invoke the
-processor method with a pointer to the operation. This should then get at the
-container struct by using container_of():
-
- static void fscache_write_op(struct fscache_operation *_op)
- {
- struct fscache_storage *op =
- container_of(_op, struct fscache_storage, op);
- ...
- }
-
-The caller holds a reference on the operation, and will invoke
-fscache_put_operation() when the processor function returns. The processor
-function is at liberty to call fscache_enqueue_operation() or to take extra
-references.
diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst
index 0aa70750df0f..085f309ece60 100644
--- a/Documentation/filesystems/ceph.rst
+++ b/Documentation/filesystems/ceph.rst
@@ -57,6 +57,16 @@ a snapshot on any subdirectory (and its nested contents) in the
system. Snapshot creation and deletion are as simple as 'mkdir
.snap/foo' and 'rmdir .snap/foo'.
+Snapshot names have two limitations:
+
+* They can not start with an underscore ('_'), as these names are reserved
+ for internal usage by the MDS.
+* They can not exceed 240 characters in size. This is because the MDS makes
+ use of long snapshot names internally, which follow the format:
+ `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have
+ more than 255 characters, and `<node-id>` takes 13 characters, the long
+ snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
+
Ceph also provides some recursive accounting on directories for nested
files and bytes. That is, a 'getfattr -d foo' on any directory in the
system will reveal the total number of nested regular files and
@@ -82,7 +92,7 @@ Mount Syntax
The basic mount syntax is::
- # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
+ # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify
@@ -90,16 +100,35 @@ happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
1.2.3.4::
- # mount -t ceph 1.2.3.4:/ /mnt/ceph
+ # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
is sufficient. If /sbin/mount.ceph is installed, a hostname can be
-used instead of an IP address.
+used instead of an IP address and the cluster FSID can be left out
+(as the mount helper will fill it in by reading the ceph configuration
+file)::
+
+ # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
+Multiple monitor addresses can be passed by separating each address with a slash (`/`)::
+ # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
+
+When using the mount helper, monitor address can be read from ceph
+configuration file if available. Note that, the cluster FSID (passed as part
+of the device string) is validated by checking it with the FSID reported by
+the monitor.
Mount Options
=============
+ mon_addr=ip_address[:port][/ip_address[:port]]
+ Monitor address to the cluster. This is used to bootstrap the
+ connection to the cluster. Once connection is established, the
+ monitor addresses in the monitor map are followed.
+
+ fsid=cluster-id
+ FSID of the cluster (from `ceph fsid` command).
+
ip=A.B.C.D[:N]
Specify the IP and/or port the client should bind to locally.
There is normally not much reason to do this. If the IP is not
@@ -163,14 +192,14 @@ Mount Options
to the default VFS implementation if this option is used.
recover_session=<no|clean>
- Set auto reconnect mode in the case where the client is blacklisted. The
+ Set auto reconnect mode in the case where the client is blocklisted. The
available modes are "no" and "clean". The default is "no".
* no: never attempt to reconnect when client detects that it has been
- blacklisted. Operations will generally fail after being blacklisted.
+ blocklisted. Operations will generally fail after being blocklisted.
* clean: client reconnects to the ceph cluster automatically when it
- detects that it has been blacklisted. During reconnect, client drops
+ detects that it has been blocklisted. During reconnect, client drops
dirty data/metadata, invalidates page caches and writable file handles.
After reconnect, file locks become stale because the MDS loses track
of them. If an inode contains any stale file locks, read/write on the
@@ -184,7 +213,6 @@ For more information on Ceph, see the home page at
The Linux kernel client source tree is available at
- https://github.com/ceph/ceph-client.git
- - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
https://github.com/ceph/ceph.git
diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst
new file mode 100644
index 000000000000..bdde7e4e010b
--- /dev/null
+++ b/Documentation/filesystems/coda.rst
@@ -0,0 +1,1670 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+Coda Kernel-Venus Interface
+===========================
+
+.. Note::
+
+ This is one of the technical documents describing a component of
+ Coda -- this document describes the client kernel-Venus interface.
+
+For more information:
+
+ http://www.coda.cs.cmu.edu
+
+For user level software needed to run Coda:
+
+ ftp://ftp.coda.cs.cmu.edu
+
+To run Coda you need to get a user level cache manager for the client,
+named Venus, as well as tools to manipulate ACLs, to log in, etc. The
+client needs to have the Coda filesystem selected in the kernel
+configuration.
+
+The server needs a user level server and at present does not depend on
+kernel support.
+
+ The Venus kernel interface
+
+ Peter J. Braam
+
+ v1.0, Nov 9, 1997
+
+ This document describes the communication between Venus and kernel
+ level filesystem code needed for the operation of the Coda file sys-
+ tem. This document version is meant to describe the current interface
+ (version 1.0) as well as improvements we envisage.
+
+.. Table of Contents
+
+ 1. Introduction
+
+ 2. Servicing Coda filesystem calls
+
+ 3. The message layer
+
+ 3.1 Implementation details
+
+ 4. The interface at the call level
+
+ 4.1 Data structures shared by the kernel and Venus
+ 4.2 The pioctl interface
+ 4.3 root
+ 4.4 lookup
+ 4.5 getattr
+ 4.6 setattr
+ 4.7 access
+ 4.8 create
+ 4.9 mkdir
+ 4.10 link
+ 4.11 symlink
+ 4.12 remove
+ 4.13 rmdir
+ 4.14 readlink
+ 4.15 open
+ 4.16 close
+ 4.17 ioctl
+ 4.18 rename
+ 4.19 readdir
+ 4.20 vget
+ 4.21 fsync
+ 4.22 inactive
+ 4.23 rdwr
+ 4.24 odymount
+ 4.25 ody_lookup
+ 4.26 ody_expand
+ 4.27 prefetch
+ 4.28 signal
+
+ 5. The minicache and downcalls
+
+ 5.1 INVALIDATE
+ 5.2 FLUSH
+ 5.3 PURGEUSER
+ 5.4 ZAPFILE
+ 5.5 ZAPDIR
+ 5.6 ZAPVNODE
+ 5.7 PURGEFID
+ 5.8 REPLACE
+
+ 6. Initialization and cleanup
+
+ 6.1 Requirements
+
+1. Introduction
+===============
+
+ A key component in the Coda Distributed File System is the cache
+ manager, Venus.
+
+ When processes on a Coda enabled system access files in the Coda
+ filesystem, requests are directed at the filesystem layer in the
+ operating system. The operating system will communicate with Venus to
+ service the request for the process. Venus manages a persistent
+ client cache and makes remote procedure calls to Coda file servers and
+ related servers (such as authentication servers) to service these
+ requests it receives from the operating system. When Venus has
+ serviced a request it replies to the operating system with appropriate
+ return codes, and other data related to the request. Optionally the
+ kernel support for Coda may maintain a minicache of recently processed
+ requests to limit the number of interactions with Venus. Venus
+ possesses the facility to inform the kernel when elements from its
+ minicache are no longer valid.
+
+ This document describes precisely this communication between the
+ kernel and Venus. The definitions of so called upcalls and downcalls
+ will be given with the format of the data they handle. We shall also
+ describe the semantic invariants resulting from the calls.
+
+ Historically Coda was implemented in a BSD file system in Mach 2.6.
+ The interface between the kernel and Venus is very similar to the BSD
+ VFS interface. Similar functionality is provided, and the format of
+ the parameters and returned data is very similar to the BSD VFS. This
+ leads to an almost natural environment for implementing a kernel-level
+ filesystem driver for Coda in a BSD system. However, other operating
+ systems such as Linux and Windows 95 and NT have virtual filesystem
+ with different interfaces.
+
+ To implement Coda on these systems some reverse engineering of the
+ Venus/Kernel protocol is necessary. Also it came to light that other
+ systems could profit significantly from certain small optimizations
+ and modifications to the protocol. To facilitate this work as well as
+ to make future ports easier, communication between Venus and the
+ kernel should be documented in great detail. This is the aim of this
+ document.
+
+2. Servicing Coda filesystem calls
+===================================
+
+ The service of a request for a Coda file system service originates in
+ a process P which accessing a Coda file. It makes a system call which
+ traps to the OS kernel. Examples of such calls trapping to the kernel
+ are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``,
+ ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32
+ environment, and are named ``CreateFile``.
+
+ Generally the operating system handles the request in a virtual
+ filesystem (VFS) layer, which is named I/O Manager in NT and IFS
+ manager in Windows 95. The VFS is responsible for partial processing
+ of the request and for locating the specific filesystem(s) which will
+ service parts of the request. Usually the information in the path
+ assists in locating the correct FS drivers. Sometimes after extensive
+ pre-processing, the VFS starts invoking exported routines in the FS
+ driver. This is the point where the FS specific processing of the
+ request starts, and here the Coda specific kernel code comes into
+ play.
+
+ The FS layer for Coda must expose and implement several interfaces.
+ First and foremost the VFS must be able to make all necessary calls to
+ the Coda FS layer, so the Coda FS driver must expose the VFS interface
+ as applicable in the operating system. These differ very significantly
+ among operating systems, but share features such as facilities to
+ read/write and create and remove objects. The Coda FS layer services
+ such VFS requests by invoking one or more well defined services
+ offered by the cache manager Venus. When the replies from Venus have
+ come back to the FS driver, servicing of the VFS call continues and
+ finishes with a reply to the kernel's VFS. Finally the VFS layer
+ returns to the process.
+
+ As a result of this design a basic interface exposed by the FS driver
+ must allow Venus to manage message traffic. In particular Venus must
+ be able to retrieve and place messages and to be notified of the
+ arrival of a new message. The notification must be through a mechanism
+ which does not block Venus since Venus must attend to other tasks even
+ when no messages are waiting or being processed.
+
+ **Interfaces of the Coda FS Driver**
+
+ Furthermore the FS layer provides for a special path of communication
+ between a user process and Venus, called the pioctl interface. The
+ pioctl interface is used for Coda specific services, such as
+ requesting detailed information about the persistent cache managed by
+ Venus. Here the involvement of the kernel is minimal. It identifies
+ the calling process and passes the information on to Venus. When
+ Venus replies the response is passed back to the caller in unmodified
+ form.
+
+ Finally Venus allows the kernel FS driver to cache the results from
+ certain services. This is done to avoid excessive context switches
+ and results in an efficient system. However, Venus may acquire
+ information, for example from the network which implies that cached
+ information must be flushed or replaced. Venus then makes a downcall
+ to the Coda FS layer to request flushes or updates in the cache. The
+ kernel FS driver handles such requests synchronously.
+
+ Among these interfaces the VFS interface and the facility to place,
+ receive and be notified of messages are platform specific. We will
+ not go into the calls exported to the VFS layer but we will state the
+ requirements of the message exchange mechanism.
+
+
+3. The message layer
+=====================
+
+ At the lowest level the communication between Venus and the FS driver
+ proceeds through messages. The synchronization between processes
+ requesting Coda file service and Venus relies on blocking and waking
+ up processes. The Coda FS driver processes VFS- and pioctl-requests
+ on behalf of a process P, creates messages for Venus, awaits replies
+ and finally returns to the caller. The implementation of the exchange
+ of messages is platform specific, but the semantics have (so far)
+ appeared to be generally applicable. Data buffers are created by the
+ FS Driver in kernel memory on behalf of P and copied to user memory in
+ Venus.
+
+ The FS Driver while servicing P makes upcalls to Venus. Such an
+ upcall is dispatched to Venus by creating a message structure. The
+ structure contains the identification of P, the message sequence
+ number, the size of the request and a pointer to the data in kernel
+ memory for the request. Since the data buffer is re-used to hold the
+ reply from Venus, there is a field for the size of the reply. A flags
+ field is used in the message to precisely record the status of the
+ message. Additional platform dependent structures involve pointers to
+ determine the position of the message on queues and pointers to
+ synchronization objects. In the upcall routine the message structure
+ is filled in, flags are set to 0, and it is placed on the *pending*
+ queue. The routine calling upcall is responsible for allocating the
+ data buffer; its structure will be described in the next section.
+
+ A facility must exist to notify Venus that the message has been
+ created, and implemented using available synchronization objects in
+ the OS. This notification is done in the upcall context of the process
+ P. When the message is on the pending queue, process P cannot proceed
+ in upcall. The (kernel mode) processing of P in the filesystem
+ request routine must be suspended until Venus has replied. Therefore
+ the calling thread in P is blocked in upcall. A pointer in the
+ message structure will locate the synchronization object on which P is
+ sleeping.
+
+ Venus detects the notification that a message has arrived, and the FS
+ driver allow Venus to retrieve the message with a getmsg_from_kernel
+ call. This action finishes in the kernel by putting the message on the
+ queue of processing messages and setting flags to READ. Venus is
+ passed the contents of the data buffer. The getmsg_from_kernel call
+ now returns and Venus processes the request.
+
+ At some later point the FS driver receives a message from Venus,
+ namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
+ driver looks at the contents of the message and decides if:
+
+
+ * the message is a reply for a suspended thread P. If so it removes
+ the message from the processing queue and marks the message as
+ WRITTEN. Finally, the FS driver unblocks P (still in the kernel
+ mode context of Venus) and the sendmsg_to_kernel call returns to
+ Venus. The process P will be scheduled at some point and continues
+ processing its upcall with the data buffer replaced with the reply
+ from Venus.
+
+ * The message is a ``downcall``. A downcall is a request from Venus to
+ the FS Driver. The FS driver processes the request immediately
+ (usually a cache eviction or replacement) and when it finishes
+ sendmsg_to_kernel returns.
+
+ Now P awakes and continues processing upcall. There are some
+ subtleties to take account of. First P will determine if it was woken
+ up in upcall by a signal from some other source (for example an
+ attempt to terminate P) or as is normally the case by Venus in its
+ sendmsg_to_kernel call. In the normal case, the upcall routine will
+ deallocate the message structure and return. The FS routine can proceed
+ with its processing.
+
+
+ **Sleeping and IPC arrangements**
+
+ In case P is woken up by a signal and not by Venus, it will first look
+ at the flags field. If the message is not yet READ, the process P can
+ handle its signal without notifying Venus. If Venus has READ, and
+ the request should not be processed, P can send Venus a signal message
+ to indicate that it should disregard the previous message. Such
+ signals are put in the queue at the head, and read first by Venus. If
+ the message is already marked as WRITTEN it is too late to stop the
+ processing. The VFS routine will now continue. (-- If a VFS request
+ involves more than one upcall, this can lead to complicated state, an
+ extra field "handle_signals" could be added in the message structure
+ to indicate points of no return have been passed.--)
+
+
+
+3.1. Implementation details
+----------------------------
+
+ The Unix implementation of this mechanism has been through the
+ implementation of a character device associated with Coda. Venus
+ retrieves messages by doing a read on the device, replies are sent
+ with a write and notification is through the select system call on the
+ file descriptor for the device. The process P is kept waiting on an
+ interruptible wait queue object.
+
+ In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
+ call is used. The DeviceIoControl call is designed to copy buffers
+ from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
+ is issued as a synchronous call, while the getmsg_from_kernel call is
+ asynchronous. Windows EventObjects are used for notification of
+ message arrival. The process P is kept waiting on a KernelEvent
+ object in NT and a semaphore in Windows 95.
+
+
+4. The interface at the call level
+===================================
+
+
+ This section describes the upcalls a Coda FS driver can make to Venus.
+ Each of these upcalls make use of two structures: inputArgs and
+ outputArgs. In pseudo BNF form the structures take the following
+ form::
+
+
+ struct inputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_short pid; /* Common to all */
+ u_short pgid; /* Common to all */
+ struct CodaCred cred; /* Common to all */
+
+ <union "in" of call dependent parts of inputArgs>
+ };
+
+ struct outputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_long result;
+
+ <union "out" of call dependent parts of inputArgs>
+ };
+
+
+
+ Before going on let us elucidate the role of the various fields. The
+ inputArgs start with the opcode which defines the type of service
+ requested from Venus. There are approximately 30 upcalls at present
+ which we will discuss. The unique field labels the inputArg with a
+ unique number which will identify the message uniquely. A process and
+ process group id are passed. Finally the credentials of the caller
+ are included.
+
+ Before delving into the specific calls we need to discuss a variety of
+ data structures shared by the kernel and Venus.
+
+
+
+
+4.1. Data structures shared by the kernel and Venus
+----------------------------------------------------
+
+
+ The CodaCred structure defines a variety of user and group ids as
+ they are set for the calling process. The vuid_t and vgid_t are 32 bit
+ unsigned integers. It also defines group membership in an array. On
+ Unix the CodaCred has proven sufficient to implement good security
+ semantics for Coda but the structure may have to undergo modification
+ for the Windows environment when these mature::
+
+ struct CodaCred {
+ vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */
+ vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
+ vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
+ };
+
+
+ .. Note::
+
+ It is questionable if we need CodaCreds in Venus. Finally Venus
+ doesn't know about groups, although it does create files with the
+ default uid/gid. Perhaps the list of group membership is superfluous.
+
+
+ The next item is the fundamental identifier used to identify Coda
+ files, the ViceFid. A fid of a file uniquely defines a file or
+ directory in the Coda filesystem within a cell [1]_::
+
+ typedef struct ViceFid {
+ VolumeId Volume;
+ VnodeId Vnode;
+ Unique_t Unique;
+ } ViceFid;
+
+ .. [1] A cell is agroup of Coda servers acting under the aegis of a single
+ system control machine or SCM. See the Coda Administration manual
+ for a detailed description of the role of the SCM.
+
+ Each of the constituent fields: VolumeId, VnodeId and Unique_t are
+ unsigned 32 bit integers. We envisage that a further field will need
+ to be prefixed to identify the Coda cell; this will probably take the
+ form of a Ipv6 size IP address naming the Coda cell through DNS.
+
+ The next important structure shared between Venus and the kernel is
+ the attributes of the file. The following structure is used to
+ exchange information. It has room for future extensions such as
+ support for device files (currently not present in Coda)::
+
+
+ struct coda_timespec {
+ int64_t tv_sec; /* seconds */
+ long tv_nsec; /* nanoseconds */
+ };
+
+ struct coda_vattr {
+ enum coda_vtype va_type; /* vnode type (for create) */
+ u_short va_mode; /* files access mode and type */
+ short va_nlink; /* number of references to file */
+ vuid_t va_uid; /* owner user id */
+ vgid_t va_gid; /* owner group id */
+ long va_fsid; /* file system id (dev for now) */
+ long va_fileid; /* file id */
+ u_quad_t va_size; /* file size in bytes */
+ long va_blocksize; /* blocksize preferred for i/o */
+ struct coda_timespec va_atime; /* time of last access */
+ struct coda_timespec va_mtime; /* time of last modification */
+ struct coda_timespec va_ctime; /* time file changed */
+ u_long va_gen; /* generation number of file */
+ u_long va_flags; /* flags defined for file */
+ dev_t va_rdev; /* device special file represents */
+ u_quad_t va_bytes; /* bytes of disk space held by file */
+ u_quad_t va_filerev; /* file modification number */
+ u_int va_vaflags; /* operations flags, see below */
+ long va_spare; /* remain quad aligned */
+ };
+
+
+4.2. The pioctl interface
+--------------------------
+
+
+ Coda specific requests can be made by application through the pioctl
+ interface. The pioctl is implemented as an ordinary ioctl on a
+ fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
+ a file handle and makes the ioctl call. Finally it closes the file.
+
+ The kernel involvement in this is limited to providing the facility to
+ open and close and pass the ioctl message and to verify that a path in
+ the pioctl data buffers is a file in a Coda filesystem.
+
+ The kernel is handed a data packet of the form::
+
+ struct {
+ const char *path;
+ struct ViceIoctl vidata;
+ int follow;
+ } data;
+
+
+
+ where::
+
+
+ struct ViceIoctl {
+ caddr_t in, out; /* Data to be transferred in, or out */
+ short in_size; /* Size of input buffer <= 2K */
+ short out_size; /* Maximum size of output buffer, <= 2K */
+ };
+
+
+
+ The path must be a Coda file, otherwise the ioctl upcall will not be
+ made.
+
+ .. Note:: The data structures and code are a mess. We need to clean this up.
+
+
+**We now proceed to document the individual calls**:
+
+
+4.3. root
+----------
+
+
+ Arguments
+ in
+
+ empty
+
+ out::
+
+ struct cfs_root_out {
+ ViceFid VFid;
+ } cfs_root;
+
+
+
+ Description
+ This call is made to Venus during the initialization of
+ the Coda filesystem. If the result is zero, the cfs_root structure
+ contains the ViceFid of the root of the Coda filesystem. If a non-zero
+ result is generated, its value is a platform dependent error code
+ indicating the difficulty Venus encountered in locating the root of
+ the Coda filesystem.
+
+4.4. lookup
+------------
+
+
+ Summary
+ Find the ViceFid and type of an object in a directory if it exists.
+
+ Arguments
+ in::
+
+ struct cfs_lookup_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_lookup;
+
+
+
+ out::
+
+ struct cfs_lookup_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_lookup;
+
+
+
+ Description
+ This call is made to determine the ViceFid and filetype of
+ a directory entry. The directory entry requested carries name 'name'
+ and Venus will search the directory identified by cfs_lookup_in.VFid.
+ The result may indicate that the name does not exist, or that
+ difficulty was encountered in finding it (e.g. due to disconnection).
+ If the result is zero, the field cfs_lookup_out.VFid contains the
+ targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
+ type of object the name designates.
+
+ The name of the object is an 8 bit character string of maximum length
+ CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
+
+ It is extremely important to realize that Venus bitwise ors the field
+ cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
+ not be put in the kernel name cache.
+
+ .. Note::
+
+ The type of the vtype is currently wrong. It should be
+ coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
+
+
+4.5. getattr
+-------------
+
+
+ Summary Get the attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_getattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr; /* XXXXX */
+ } cfs_getattr;
+
+
+
+ out::
+
+ struct cfs_getattr_out {
+ struct coda_vattr attr;
+ } cfs_getattr;
+
+
+
+ Description
+ This call returns the attributes of the file identified by fid.
+
+ Errors
+ Errors can occur if the object with fid does not exist, is
+ unaccessible or if the caller does not have permission to fetch
+ attributes.
+
+ .. Note::
+
+ Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
+ the attributes as well as the Fid for the instantiation of an internal
+ "inode" or "FileHandle". A significant improvement in performance on
+ such systems could be made by combining the lookup and getattr calls
+ both at the Venus/kernel interaction level and at the RPC level.
+
+ The vattr structure included in the input arguments is superfluous and
+ should be removed.
+
+
+4.6. setattr
+-------------
+
+
+ Summary
+ Set the attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_setattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_setattr;
+
+
+
+
+ out
+
+ empty
+
+ Description
+ The structure attr is filled with attributes to be changed
+ in BSD style. Attributes not to be changed are set to -1, apart from
+ vtype which is set to VNON. Other are set to the value to be assigned.
+ The only attributes which the FS driver may request to change are the
+ mode, owner, groupid, atime, mtime and ctime. The return value
+ indicates success or failure.
+
+ Errors
+ A variety of errors can occur. The object may not exist, may
+ be inaccessible, or permission may not be granted by Venus.
+
+
+4.7. access
+------------
+
+
+ Arguments
+ in::
+
+ struct cfs_access_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_access;
+
+
+
+ out
+
+ empty
+
+ Description
+ Verify if access to the object identified by VFid for
+ operations described by flags is permitted. The result indicates if
+ access will be granted. It is important to remember that Coda uses
+ ACLs to enforce protection and that ultimately the servers, not the
+ clients enforce the security of the system. The result of this call
+ will depend on whether a token is held by the user.
+
+ Errors
+ The object may not exist, or the ACL describing the protection
+ may not be accessible.
+
+
+4.8. create
+------------
+
+
+ Summary
+ Invoked to create a file
+
+ Arguments
+ in::
+
+ struct cfs_create_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ int excl;
+ int mode;
+ char *name; /* Place holder for data. */
+ } cfs_create;
+
+
+
+
+ out::
+
+ struct cfs_create_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_create;
+
+
+
+ Description
+ This upcall is invoked to request creation of a file.
+ The file will be created in the directory identified by VFid, its name
+ will be name, and the mode will be mode. If excl is set an error will
+ be returned if the file already exists. If the size field in attr is
+ set to zero the file will be truncated. The uid and gid of the file
+ are set by converting the CodaCred to a uid using a macro CRTOUID
+ (this macro is platform dependent). Upon success the VFid and
+ attributes of the file are returned. The Coda FS Driver will normally
+ instantiate a vnode, inode or file handle at kernel level for the new
+ object.
+
+
+ Errors
+ A variety of errors can occur. Permissions may be insufficient.
+ If the object exists and is not a file the error EISDIR is returned
+ under Unix.
+
+ .. Note::
+
+ The packing of parameters is very inefficient and appears to
+ indicate confusion between the system call creat and the VFS operation
+ create. The VFS operation create is only called to create new objects.
+ This create call differs from the Unix one in that it is not invoked
+ to return a file descriptor. The truncate and exclusive options,
+ together with the mode, could simply be part of the mode as it is
+ under Unix. There should be no flags argument; this is used in open
+ (2) to return a file descriptor for READ or WRITE mode.
+
+ The attributes of the directory should be returned too, since the size
+ and mtime changed.
+
+
+4.9. mkdir
+-----------
+
+
+ Summary
+ Create a new directory.
+
+ Arguments
+ in::
+
+ struct cfs_mkdir_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ char *name; /* Place holder for data. */
+ } cfs_mkdir;
+
+
+
+ out::
+
+ struct cfs_mkdir_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_mkdir;
+
+
+
+
+ Description
+ This call is similar to create but creates a directory.
+ Only the mode field in the input parameters is used for creation.
+ Upon successful creation, the attr returned contains the attributes of
+ the new directory.
+
+ Errors
+ As for create.
+
+ .. Note::
+
+ The input parameter should be changed to mode instead of
+ attributes.
+
+ The attributes of the parent should be returned since the size and
+ mtime changes.
+
+
+4.10. link
+-----------
+
+
+ Summary
+ Create a link to an existing file.
+
+ Arguments
+ in::
+
+ struct cfs_link_in {
+ ViceFid sourceFid; /* cnode to link *to* */
+ ViceFid destFid; /* Directory in which to place link */
+ char *tname; /* Place holder for data. */
+ } cfs_link;
+
+
+
+ out
+
+ empty
+
+ Description
+ This call creates a link to the sourceFid in the directory
+ identified by destFid with name tname. The source must reside in the
+ target's parent, i.e. the source must be have parent destFid, i.e. Coda
+ does not support cross directory hard links. Only the return value is
+ relevant. It indicates success or the type of failure.
+
+ Errors
+ The usual errors can occur.
+
+
+4.11. symlink
+--------------
+
+
+ Summary
+ create a symbolic link
+
+ Arguments
+ in::
+
+ struct cfs_symlink_in {
+ ViceFid VFid; /* Directory to put symlink in */
+ char *srcname;
+ struct coda_vattr attr;
+ char *tname;
+ } cfs_symlink;
+
+
+
+ out
+
+ none
+
+ Description
+ Create a symbolic link. The link is to be placed in the
+ directory identified by VFid and named tname. It should point to the
+ pathname srcname. The attributes of the newly created object are to
+ be set to attr.
+
+ .. Note::
+
+ The attributes of the target directory should be returned since
+ its size changed.
+
+
+4.12. remove
+-------------
+
+
+ Summary
+ Remove a file
+
+ Arguments
+ in::
+
+ struct cfs_remove_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_remove;
+
+
+
+ out
+
+ none
+
+ Description
+ Remove file named cfs_remove_in.name in directory
+ identified by VFid.
+
+
+ .. Note::
+
+ The attributes of the directory should be returned since its
+ mtime and size may change.
+
+
+4.13. rmdir
+------------
+
+
+ Summary
+ Remove a directory
+
+ Arguments
+ in::
+
+ struct cfs_rmdir_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_rmdir;
+
+
+
+ out
+
+ none
+
+ Description
+ Remove the directory with name 'name' from the directory
+ identified by VFid.
+
+ .. Note:: The attributes of the parent directory should be returned since
+ its mtime and size may change.
+
+
+4.14. readlink
+---------------
+
+
+ Summary
+ Read the value of a symbolic link.
+
+ Arguments
+ in::
+
+ struct cfs_readlink_in {
+ ViceFid VFid;
+ } cfs_readlink;
+
+
+
+ out::
+
+ struct cfs_readlink_out {
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readlink;
+
+
+
+ Description
+ This routine reads the contents of symbolic link
+ identified by VFid into the buffer data. The buffer data must be able
+ to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
+
+ Errors
+ No unusual errors.
+
+
+4.15. open
+-----------
+
+
+ Summary
+ Open a file.
+
+ Arguments
+ in::
+
+ struct cfs_open_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_open;
+
+
+
+ out::
+
+ struct cfs_open_out {
+ dev_t dev;
+ ino_t inode;
+ } cfs_open;
+
+
+
+ Description
+ This request asks Venus to place the file identified by
+ VFid in its cache and to note that the calling process wishes to open
+ it with flags as in open(2). The return value to the kernel differs
+ for Unix and Windows systems. For Unix systems the Coda FS Driver is
+ informed of the device and inode number of the container file in the
+ fields dev and inode. For Windows the path of the container file is
+ returned to the kernel.
+
+
+ .. Note::
+
+ Currently the cfs_open_out structure is not properly adapted to
+ deal with the Windows case. It might be best to implement two
+ upcalls, one to open aiming at a container file name, the other at a
+ container file inode.
+
+
+4.16. close
+------------
+
+
+ Summary
+ Close a file, update it on the servers.
+
+ Arguments
+ in::
+
+ struct cfs_close_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_close;
+
+
+
+ out
+
+ none
+
+ Description
+ Close the file identified by VFid.
+
+ .. Note::
+
+ The flags argument is bogus and not used. However, Venus' code
+ has room to deal with an execp input field, probably this field should
+ be used to inform Venus that the file was closed but is still memory
+ mapped for execution. There are comments about fetching versus not
+ fetching the data in Venus vproc_vfscalls. This seems silly. If a
+ file is being closed, the data in the container file is to be the new
+ data. Here again the execp flag might be in play to create confusion:
+ currently Venus might think a file can be flushed from the cache when
+ it is still memory mapped. This needs to be understood.
+
+
+4.17. ioctl
+------------
+
+
+ Summary
+ Do an ioctl on a file. This includes the pioctl interface.
+
+ Arguments
+ in::
+
+ struct cfs_ioctl_in {
+ ViceFid VFid;
+ int cmd;
+ int len;
+ int rwflag;
+ char *data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ out::
+
+
+ struct cfs_ioctl_out {
+ int len;
+ caddr_t data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ Description
+ Do an ioctl operation on a file. The command, len and
+ data arguments are filled as usual. flags is not used by Venus.
+
+ .. Note::
+
+ Another bogus parameter. flags is not used. What is the
+ business about PREFETCHING in the Venus code?
+
+
+
+4.18. rename
+-------------
+
+
+ Summary
+ Rename a fid.
+
+ Arguments
+ in::
+
+ struct cfs_rename_in {
+ ViceFid sourceFid;
+ char *srcname;
+ ViceFid destFid;
+ char *destname;
+ } cfs_rename;
+
+
+
+ out
+
+ none
+
+ Description
+ Rename the object with name srcname in directory
+ sourceFid to destname in destFid. It is important that the names
+ srcname and destname are 0 terminated strings. Strings in Unix
+ kernels are not always null terminated.
+
+
+4.19. readdir
+--------------
+
+
+ Summary
+ Read directory entries.
+
+ Arguments
+ in::
+
+ struct cfs_readdir_in {
+ ViceFid VFid;
+ int count;
+ int offset;
+ } cfs_readdir;
+
+
+
+
+ out::
+
+ struct cfs_readdir_out {
+ int size;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readdir;
+
+
+
+ Description
+ Read directory entries from VFid starting at offset and
+ read at most count bytes. Returns the data in data and returns
+ the size in size.
+
+
+ .. Note::
+
+ This call is not used. Readdir operations exploit container
+ files. We will re-evaluate this during the directory revamp which is
+ about to take place.
+
+
+4.20. vget
+-----------
+
+
+ Summary
+ instructs Venus to do an FSDB->Get.
+
+ Arguments
+ in::
+
+ struct cfs_vget_in {
+ ViceFid VFid;
+ } cfs_vget;
+
+
+
+ out::
+
+ struct cfs_vget_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_vget;
+
+
+
+ Description
+ This upcall asks Venus to do a get operation on an fsobj
+ labelled by VFid.
+
+ .. Note::
+
+ This operation is not used. However, it is extremely useful
+ since it can be used to deal with read/write memory mapped files.
+ These can be "pinned" in the Venus cache using vget and released with
+ inactive.
+
+
+4.21. fsync
+------------
+
+
+ Summary
+ Tell Venus to update the RVM attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_fsync_in {
+ ViceFid VFid;
+ } cfs_fsync;
+
+
+
+ out
+
+ none
+
+ Description
+ Ask Venus to update RVM attributes of object VFid. This
+ should be called as part of kernel level fsync type calls. The
+ result indicates if the syncing was successful.
+
+ .. Note:: Linux does not implement this call. It should.
+
+
+4.22. inactive
+---------------
+
+
+ Summary
+ Tell Venus a vnode is no longer in use.
+
+ Arguments
+ in::
+
+ struct cfs_inactive_in {
+ ViceFid VFid;
+ } cfs_inactive;
+
+
+
+ out
+
+ none
+
+ Description
+ This operation returns EOPNOTSUPP.
+
+ .. Note:: This should perhaps be removed.
+
+
+4.23. rdwr
+-----------
+
+
+ Summary
+ Read or write from a file
+
+ Arguments
+ in::
+
+ struct cfs_rdwr_in {
+ ViceFid VFid;
+ int rwflag;
+ int count;
+ int offset;
+ int ioflag;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+
+ out::
+
+ struct cfs_rdwr_out {
+ int rwflag;
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+ Description
+ This upcall asks Venus to read or write from a file.
+
+
+ .. Note::
+
+ It should be removed since it is against the Coda philosophy that
+ read/write operations never reach Venus. I have been told the
+ operation does not work. It is not currently used.
+
+
+
+4.24. odymount
+---------------
+
+
+ Summary
+ Allows mounting multiple Coda "filesystems" on one Unix mount point.
+
+ Arguments
+ in::
+
+ struct ody_mount_in {
+ char *name; /* Place holder for data. */
+ } ody_mount;
+
+
+
+ out::
+
+ struct ody_mount_out {
+ ViceFid VFid;
+ } ody_mount;
+
+
+
+ Description
+ Asks Venus to return the rootfid of a Coda system named
+ name. The fid is returned in VFid.
+
+ .. Note::
+
+ This call was used by David for dynamic sets. It should be
+ removed since it causes a jungle of pointers in the VFS mounting area.
+ It is not used by Coda proper. Call is not implemented by Venus.
+
+
+4.25. ody_lookup
+-----------------
+
+
+ Summary
+ Looks up something.
+
+ Arguments
+ in
+
+ irrelevant
+
+
+ out
+
+ irrelevant
+
+
+ .. Note:: Gut it. Call is not implemented by Venus.
+
+
+4.26. ody_expand
+-----------------
+
+
+ Summary
+ expands something in a dynamic set.
+
+ Arguments
+ in
+
+ irrelevant
+
+ out
+
+ irrelevant
+
+ .. Note:: Gut it. Call is not implemented by Venus.
+
+
+4.27. prefetch
+---------------
+
+
+ Summary
+ Prefetch a dynamic set.
+
+ Arguments
+
+ in
+
+ Not documented.
+
+ out
+
+ Not documented.
+
+ Description
+ Venus worker.cc has support for this call, although it is
+ noted that it doesn't work. Not surprising, since the kernel does not
+ have support for it. (ODY_PREFETCH is not a defined operation).
+
+
+ .. Note:: Gut it. It isn't working and isn't used by Coda.
+
+
+
+4.28. signal
+-------------
+
+
+ Summary
+ Send Venus a signal about an upcall.
+
+ Arguments
+ in
+
+ none
+
+ out
+
+ not applicable.
+
+ Description
+ This is an out-of-band upcall to Venus to inform Venus
+ that the calling process received a signal after Venus read the
+ message from the input queue. Venus is supposed to clean up the
+ operation.
+
+ Errors
+ No reply is given.
+
+ .. Note::
+
+ We need to better understand what Venus needs to clean up and if
+ it is doing this correctly. Also we need to handle multiple upcall
+ per system call situations correctly. It would be important to know
+ what state changes in Venus take place after an upcall for which the
+ kernel is responsible for notifying Venus to clean up (e.g. open
+ definitely is such a state change, but many others are maybe not).
+
+
+5. The minicache and downcalls
+===============================
+
+
+ The Coda FS Driver can cache results of lookup and access upcalls, to
+ limit the frequency of upcalls. Upcalls carry a price since a process
+ context switch needs to take place. The counterpart of caching the
+ information is that Venus will notify the FS Driver that cached
+ entries must be flushed or renamed.
+
+ The kernel code generally has to maintain a structure which links the
+ internal file handles (called vnodes in BSD, inodes in Linux and
+ FileHandles in Windows) with the ViceFid's which Venus maintains. The
+ reason is that frequent translations back and forth are needed in
+ order to make upcalls and use the results of upcalls. Such linking
+ objects are called cnodes.
+
+ The current minicache implementations have cache entries which record
+ the following:
+
+ 1. the name of the file
+
+ 2. the cnode of the directory containing the object
+
+ 3. a list of CodaCred's for which the lookup is permitted.
+
+ 4. the cnode of the object
+
+ The lookup call in the Coda FS Driver may request the cnode of the
+ desired object from the cache, by passing its name, directory and the
+ CodaCred's of the caller. The cache will return the cnode or indicate
+ that it cannot be found. The Coda FS Driver must be careful to
+ invalidate cache entries when it modifies or removes objects.
+
+ When Venus obtains information that indicates that cache entries are
+ no longer valid, it will make a downcall to the kernel. Downcalls are
+ intercepted by the Coda FS Driver and lead to cache invalidations of
+ the kind described below. The Coda FS Driver does not return an error
+ unless the downcall data could not be read into kernel memory.
+
+
+5.1. INVALIDATE
+----------------
+
+
+ No information is available on this call.
+
+
+5.2. FLUSH
+-----------
+
+
+
+ Arguments
+ None
+
+ Summary
+ Flush the name cache entirely.
+
+ Description
+ Venus issues this call upon startup and when it dies. This
+ is to prevent stale cache information being held. Some operating
+ systems allow the kernel name cache to be switched off dynamically.
+ When this is done, this downcall is made.
+
+
+5.3. PURGEUSER
+---------------
+
+
+ Arguments
+ ::
+
+ struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
+ struct CodaCred cred;
+ } cfs_purgeuser;
+
+
+
+ Description
+ Remove all entries in the cache carrying the Cred. This
+ call is issued when tokens for a user expire or are flushed.
+
+
+5.4. ZAPFILE
+-------------
+
+
+ Arguments
+ ::
+
+ struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapfile;
+
+
+
+ Description
+ Remove all entries which have the (dir vnode, name) pair.
+ This is issued as a result of an invalidation of cached attributes of
+ a vnode.
+
+ .. Note::
+
+ Call is not named correctly in NetBSD and Mach. The minicache
+ zapfile routine takes different arguments. Linux does not implement
+ the invalidation of attributes correctly.
+
+
+
+5.5. ZAPDIR
+------------
+
+
+ Arguments
+ ::
+
+ struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapdir;
+
+
+
+ Description
+ Remove all entries in the cache lying in a directory
+ CodaFid, and all children of this directory. This call is issued when
+ Venus receives a callback on the directory.
+
+
+5.6. ZAPVNODE
+--------------
+
+
+
+ Arguments
+ ::
+
+ struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
+ struct CodaCred cred;
+ ViceFid VFid;
+ } cfs_zapvnode;
+
+
+
+ Description
+ Remove all entries in the cache carrying the cred and VFid
+ as in the arguments. This downcall is probably never issued.
+
+
+5.7. PURGEFID
+--------------
+
+
+ Arguments
+ ::
+
+ struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_purgefid;
+
+
+
+ Description
+ Flush the attribute for the file. If it is a dir (odd
+ vnode), purge its children from the namecache and remove the file from the
+ namecache.
+
+
+
+5.8. REPLACE
+-------------
+
+
+ Summary
+ Replace the Fid's for a collection of names.
+
+ Arguments
+ ::
+
+ struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
+ ViceFid NewFid;
+ ViceFid OldFid;
+ } cfs_replace;
+
+
+
+ Description
+ This routine replaces a ViceFid in the name cache with
+ another. It is added to allow Venus during reintegration to replace
+ locally allocated temp fids while disconnected with global fids even
+ when the reference counts on those fids are not zero.
+
+
+6. Initialization and cleanup
+==============================
+
+
+ This section gives brief hints as to desirable features for the Coda
+ FS Driver at startup and upon shutdown or Venus failures. Before
+ entering the discussion it is useful to repeat that the Coda FS Driver
+ maintains the following data:
+
+
+ 1. message queues
+
+ 2. cnodes
+
+ 3. name cache entries
+
+ The name cache entries are entirely private to the driver, so they
+ can easily be manipulated. The message queues will generally have
+ clear points of initialization and destruction. The cnodes are
+ much more delicate. User processes hold reference counts in Coda
+ filesystems and it can be difficult to clean up the cnodes.
+
+ It can expect requests through:
+
+ 1. the message subsystem
+
+ 2. the VFS layer
+
+ 3. pioctl interface
+
+ Currently the pioctl passes through the VFS for Coda so we can
+ treat these similarly.
+
+
+6.1. Requirements
+------------------
+
+
+ The following requirements should be accommodated:
+
+ 1. The message queues should have open and close routines. On Unix
+ the opening of the character devices are such routines.
+
+ - Before opening, no messages can be placed.
+
+ - Opening will remove any old messages still pending.
+
+ - Close will notify any sleeping processes that their upcall cannot
+ be completed.
+
+ - Close will free all memory allocated by the message queues.
+
+
+ 2. At open the namecache shall be initialized to empty state.
+
+ 3. Before the message queues are open, all VFS operations will fail.
+ Fortunately this can be achieved by making sure than mounting the
+ Coda filesystem cannot succeed before opening.
+
+ 4. After closing of the queues, no VFS operations can succeed. Here
+ one needs to be careful, since a few operations (lookup,
+ read/write, readdir) can proceed without upcalls. These must be
+ explicitly blocked.
+
+ 5. Upon closing the namecache shall be flushed and disabled.
+
+ 6. All memory held by cnodes can be freed without relying on upcalls.
+
+ 7. Unmounting the file system can be done without relying on upcalls.
+
+ 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
+ get the rootfid or the attributes of the rootfid. The latter is
+ best implemented by Venus fetching these objects before attempting
+ to mount.
+
+ .. Note::
+
+ NetBSD in particular but also Linux have not implemented the
+ above requirements fully. For smooth operation this needs to be
+ corrected.
+
+
+
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
deleted file mode 100644
index 1711ad48e38a..000000000000
--- a/Documentation/filesystems/coda.txt
+++ /dev/null
@@ -1,1676 +0,0 @@
-NOTE:
-This is one of the technical documents describing a component of
-Coda -- this document describes the client kernel-Venus interface.
-
-For more information:
- http://www.coda.cs.cmu.edu
-For user level software needed to run Coda:
- ftp://ftp.coda.cs.cmu.edu
-
-To run Coda you need to get a user level cache manager for the client,
-named Venus, as well as tools to manipulate ACLs, to log in, etc. The
-client needs to have the Coda filesystem selected in the kernel
-configuration.
-
-The server needs a user level server and at present does not depend on
-kernel support.
-
-
-
-
-
-
-
- The Venus kernel interface
- Peter J. Braam
- v1.0, Nov 9, 1997
-
- This document describes the communication between Venus and kernel
- level filesystem code needed for the operation of the Coda file sys-
- tem. This document version is meant to describe the current interface
- (version 1.0) as well as improvements we envisage.
- ______________________________________________________________________
-
- Table of Contents
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. Introduction
-
- 2. Servicing Coda filesystem calls
-
- 3. The message layer
-
- 3.1 Implementation details
-
- 4. The interface at the call level
-
- 4.1 Data structures shared by the kernel and Venus
- 4.2 The pioctl interface
- 4.3 root
- 4.4 lookup
- 4.5 getattr
- 4.6 setattr
- 4.7 access
- 4.8 create
- 4.9 mkdir
- 4.10 link
- 4.11 symlink
- 4.12 remove
- 4.13 rmdir
- 4.14 readlink
- 4.15 open
- 4.16 close
- 4.17 ioctl
- 4.18 rename
- 4.19 readdir
- 4.20 vget
- 4.21 fsync
- 4.22 inactive
- 4.23 rdwr
- 4.24 odymount
- 4.25 ody_lookup
- 4.26 ody_expand
- 4.27 prefetch
- 4.28 signal
-
- 5. The minicache and downcalls
-
- 5.1 INVALIDATE
- 5.2 FLUSH
- 5.3 PURGEUSER
- 5.4 ZAPFILE
- 5.5 ZAPDIR
- 5.6 ZAPVNODE
- 5.7 PURGEFID
- 5.8 REPLACE
-
- 6. Initialization and cleanup
-
- 6.1 Requirements
-
-
- ______________________________________________________________________
- 0wpage
-
- 11.. IInnttrroodduuccttiioonn
-
-
-
- A key component in the Coda Distributed File System is the cache
- manager, _V_e_n_u_s.
-
-
- When processes on a Coda enabled system access files in the Coda
- filesystem, requests are directed at the filesystem layer in the
- operating system. The operating system will communicate with Venus to
- service the request for the process. Venus manages a persistent
- client cache and makes remote procedure calls to Coda file servers and
- related servers (such as authentication servers) to service these
- requests it receives from the operating system. When Venus has
- serviced a request it replies to the operating system with appropriate
- return codes, and other data related to the request. Optionally the
- kernel support for Coda may maintain a minicache of recently processed
- requests to limit the number of interactions with Venus. Venus
- possesses the facility to inform the kernel when elements from its
- minicache are no longer valid.
-
- This document describes precisely this communication between the
- kernel and Venus. The definitions of so called upcalls and downcalls
- will be given with the format of the data they handle. We shall also
- describe the semantic invariants resulting from the calls.
-
- Historically Coda was implemented in a BSD file system in Mach 2.6.
- The interface between the kernel and Venus is very similar to the BSD
- VFS interface. Similar functionality is provided, and the format of
- the parameters and returned data is very similar to the BSD VFS. This
- leads to an almost natural environment for implementing a kernel-level
- filesystem driver for Coda in a BSD system. However, other operating
- systems such as Linux and Windows 95 and NT have virtual filesystem
- with different interfaces.
-
- To implement Coda on these systems some reverse engineering of the
- Venus/Kernel protocol is necessary. Also it came to light that other
- systems could profit significantly from certain small optimizations
- and modifications to the protocol. To facilitate this work as well as
- to make future ports easier, communication between Venus and the
- kernel should be documented in great detail. This is the aim of this
- document.
-
- 0wpage
-
- 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss
-
- The service of a request for a Coda file system service originates in
- a process PP which accessing a Coda file. It makes a system call which
- traps to the OS kernel. Examples of such calls trapping to the kernel
- are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix
- context. Similar calls exist in the Win32 environment, and are named
- _C_r_e_a_t_e_F_i_l_e_, .
-
- Generally the operating system handles the request in a virtual
- filesystem (VFS) layer, which is named I/O Manager in NT and IFS
- manager in Windows 95. The VFS is responsible for partial processing
- of the request and for locating the specific filesystem(s) which will
- service parts of the request. Usually the information in the path
- assists in locating the correct FS drivers. Sometimes after extensive
- pre-processing, the VFS starts invoking exported routines in the FS
- driver. This is the point where the FS specific processing of the
- request starts, and here the Coda specific kernel code comes into
- play.
-
- The FS layer for Coda must expose and implement several interfaces.
- First and foremost the VFS must be able to make all necessary calls to
- the Coda FS layer, so the Coda FS driver must expose the VFS interface
- as applicable in the operating system. These differ very significantly
- among operating systems, but share features such as facilities to
- read/write and create and remove objects. The Coda FS layer services
- such VFS requests by invoking one or more well defined services
- offered by the cache manager Venus. When the replies from Venus have
- come back to the FS driver, servicing of the VFS call continues and
- finishes with a reply to the kernel's VFS. Finally the VFS layer
- returns to the process.
-
- As a result of this design a basic interface exposed by the FS driver
- must allow Venus to manage message traffic. In particular Venus must
- be able to retrieve and place messages and to be notified of the
- arrival of a new message. The notification must be through a mechanism
- which does not block Venus since Venus must attend to other tasks even
- when no messages are waiting or being processed.
-
-
-
-
-
-
- Interfaces of the Coda FS Driver
-
- Furthermore the FS layer provides for a special path of communication
- between a user process and Venus, called the pioctl interface. The
- pioctl interface is used for Coda specific services, such as
- requesting detailed information about the persistent cache managed by
- Venus. Here the involvement of the kernel is minimal. It identifies
- the calling process and passes the information on to Venus. When
- Venus replies the response is passed back to the caller in unmodified
- form.
-
- Finally Venus allows the kernel FS driver to cache the results from
- certain services. This is done to avoid excessive context switches
- and results in an efficient system. However, Venus may acquire
- information, for example from the network which implies that cached
- information must be flushed or replaced. Venus then makes a downcall
- to the Coda FS layer to request flushes or updates in the cache. The
- kernel FS driver handles such requests synchronously.
-
- Among these interfaces the VFS interface and the facility to place,
- receive and be notified of messages are platform specific. We will
- not go into the calls exported to the VFS layer but we will state the
- requirements of the message exchange mechanism.
-
- 0wpage
-
- 33.. TThhee mmeessssaaggee llaayyeerr
-
-
-
- At the lowest level the communication between Venus and the FS driver
- proceeds through messages. The synchronization between processes
- requesting Coda file service and Venus relies on blocking and waking
- up processes. The Coda FS driver processes VFS- and pioctl-requests
- on behalf of a process P, creates messages for Venus, awaits replies
- and finally returns to the caller. The implementation of the exchange
- of messages is platform specific, but the semantics have (so far)
- appeared to be generally applicable. Data buffers are created by the
- FS Driver in kernel memory on behalf of P and copied to user memory in
- Venus.
-
- The FS Driver while servicing P makes upcalls to Venus. Such an
- upcall is dispatched to Venus by creating a message structure. The
- structure contains the identification of P, the message sequence
- number, the size of the request and a pointer to the data in kernel
- memory for the request. Since the data buffer is re-used to hold the
- reply from Venus, there is a field for the size of the reply. A flags
- field is used in the message to precisely record the status of the
- message. Additional platform dependent structures involve pointers to
- determine the position of the message on queues and pointers to
- synchronization objects. In the upcall routine the message structure
- is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g
- queue. The routine calling upcall is responsible for allocating the
- data buffer; its structure will be described in the next section.
-
- A facility must exist to notify Venus that the message has been
- created, and implemented using available synchronization objects in
- the OS. This notification is done in the upcall context of the process
- P. When the message is on the pending queue, process P cannot proceed
- in upcall. The (kernel mode) processing of P in the filesystem
- request routine must be suspended until Venus has replied. Therefore
- the calling thread in P is blocked in upcall. A pointer in the
- message structure will locate the synchronization object on which P is
- sleeping.
-
- Venus detects the notification that a message has arrived, and the FS
- driver allow Venus to retrieve the message with a getmsg_from_kernel
- call. This action finishes in the kernel by putting the message on the
- queue of processing messages and setting flags to READ. Venus is
- passed the contents of the data buffer. The getmsg_from_kernel call
- now returns and Venus processes the request.
-
- At some later point the FS driver receives a message from Venus,
- namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
- driver looks at the contents of the message and decides if:
-
-
- +o the message is a reply for a suspended thread P. If so it removes
- the message from the processing queue and marks the message as
- WRITTEN. Finally, the FS driver unblocks P (still in the kernel
- mode context of Venus) and the sendmsg_to_kernel call returns to
- Venus. The process P will be scheduled at some point and continues
- processing its upcall with the data buffer replaced with the reply
- from Venus.
-
- +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to
- the FS Driver. The FS driver processes the request immediately
- (usually a cache eviction or replacement) and when it finishes
- sendmsg_to_kernel returns.
-
- Now P awakes and continues processing upcall. There are some
- subtleties to take account of. First P will determine if it was woken
- up in upcall by a signal from some other source (for example an
- attempt to terminate P) or as is normally the case by Venus in its
- sendmsg_to_kernel call. In the normal case, the upcall routine will
- deallocate the message structure and return. The FS routine can proceed
- with its processing.
-
-
-
-
-
-
-
- Sleeping and IPC arrangements
-
- In case P is woken up by a signal and not by Venus, it will first look
- at the flags field. If the message is not yet READ, the process P can
- handle its signal without notifying Venus. If Venus has READ, and
- the request should not be processed, P can send Venus a signal message
- to indicate that it should disregard the previous message. Such
- signals are put in the queue at the head, and read first by Venus. If
- the message is already marked as WRITTEN it is too late to stop the
- processing. The VFS routine will now continue. (-- If a VFS request
- involves more than one upcall, this can lead to complicated state, an
- extra field "handle_signals" could be added in the message structure
- to indicate points of no return have been passed.--)
-
-
-
- 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss
-
- The Unix implementation of this mechanism has been through the
- implementation of a character device associated with Coda. Venus
- retrieves messages by doing a read on the device, replies are sent
- with a write and notification is through the select system call on the
- file descriptor for the device. The process P is kept waiting on an
- interruptible wait queue object.
-
- In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
- call is used. The DeviceIoControl call is designed to copy buffers
- from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
- is issued as a synchronous call, while the getmsg_from_kernel call is
- asynchronous. Windows EventObjects are used for notification of
- message arrival. The process P is kept waiting on a KernelEvent
- object in NT and a semaphore in Windows 95.
-
- 0wpage
-
- 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell
-
-
- This section describes the upcalls a Coda FS driver can make to Venus.
- Each of these upcalls make use of two structures: inputArgs and
- outputArgs. In pseudo BNF form the structures take the following
- form:
-
-
- struct inputArgs {
- u_long opcode;
- u_long unique; /* Keep multiple outstanding msgs distinct */
- u_short pid; /* Common to all */
- u_short pgid; /* Common to all */
- struct CodaCred cred; /* Common to all */
-
- <union "in" of call dependent parts of inputArgs>
- };
-
- struct outputArgs {
- u_long opcode;
- u_long unique; /* Keep multiple outstanding msgs distinct */
- u_long result;
-
- <union "out" of call dependent parts of inputArgs>
- };
-
-
-
- Before going on let us elucidate the role of the various fields. The
- inputArgs start with the opcode which defines the type of service
- requested from Venus. There are approximately 30 upcalls at present
- which we will discuss. The unique field labels the inputArg with a
- unique number which will identify the message uniquely. A process and
- process group id are passed. Finally the credentials of the caller
- are included.
-
- Before delving into the specific calls we need to discuss a variety of
- data structures shared by the kernel and Venus.
-
-
-
-
- 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss
-
-
- The CodaCred structure defines a variety of user and group ids as
- they are set for the calling process. The vuid_t and vgid_t are 32 bit
- unsigned integers. It also defines group membership in an array. On
- Unix the CodaCred has proven sufficient to implement good security
- semantics for Coda but the structure may have to undergo modification
- for the Windows environment when these mature.
-
- struct CodaCred {
- vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */
- vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
- vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
- };
-
-
-
- NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus
- doesn't know about groups, although it does create files with the
- default uid/gid. Perhaps the list of group membership is superfluous.
-
-
- The next item is the fundamental identifier used to identify Coda
- files, the ViceFid. A fid of a file uniquely defines a file or
- directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a
- group of Coda servers acting under the aegis of a single system
- control machine or SCM. See the Coda Administration manual for a
- detailed description of the role of the SCM.--)
-
-
- typedef struct ViceFid {
- VolumeId Volume;
- VnodeId Vnode;
- Unique_t Unique;
- } ViceFid;
-
-
-
- Each of the constituent fields: VolumeId, VnodeId and Unique_t are
- unsigned 32 bit integers. We envisage that a further field will need
- to be prefixed to identify the Coda cell; this will probably take the
- form of a Ipv6 size IP address naming the Coda cell through DNS.
-
- The next important structure shared between Venus and the kernel is
- the attributes of the file. The following structure is used to
- exchange information. It has room for future extensions such as
- support for device files (currently not present in Coda).
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- struct coda_timespec {
- int64_t tv_sec; /* seconds */
- long tv_nsec; /* nanoseconds */
- };
-
- struct coda_vattr {
- enum coda_vtype va_type; /* vnode type (for create) */
- u_short va_mode; /* files access mode and type */
- short va_nlink; /* number of references to file */
- vuid_t va_uid; /* owner user id */
- vgid_t va_gid; /* owner group id */
- long va_fsid; /* file system id (dev for now) */
- long va_fileid; /* file id */
- u_quad_t va_size; /* file size in bytes */
- long va_blocksize; /* blocksize preferred for i/o */
- struct coda_timespec va_atime; /* time of last access */
- struct coda_timespec va_mtime; /* time of last modification */
- struct coda_timespec va_ctime; /* time file changed */
- u_long va_gen; /* generation number of file */
- u_long va_flags; /* flags defined for file */
- dev_t va_rdev; /* device special file represents */
- u_quad_t va_bytes; /* bytes of disk space held by file */
- u_quad_t va_filerev; /* file modification number */
- u_int va_vaflags; /* operations flags, see below */
- long va_spare; /* remain quad aligned */
- };
-
-
-
-
- 44..22.. TThhee ppiiooccttll iinntteerrffaaccee
-
-
- Coda specific requests can be made by application through the pioctl
- interface. The pioctl is implemented as an ordinary ioctl on a
- fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
- a file handle and makes the ioctl call. Finally it closes the file.
-
- The kernel involvement in this is limited to providing the facility to
- open and close and pass the ioctl message _a_n_d to verify that a path in
- the pioctl data buffers is a file in a Coda filesystem.
-
- The kernel is handed a data packet of the form:
-
- struct {
- const char *path;
- struct ViceIoctl vidata;
- int follow;
- } data;
-
-
-
- where
-
-
- struct ViceIoctl {
- caddr_t in, out; /* Data to be transferred in, or out */
- short in_size; /* Size of input buffer <= 2K */
- short out_size; /* Maximum size of output buffer, <= 2K */
- };
-
-
-
- The path must be a Coda file, otherwise the ioctl upcall will not be
- made.
-
- NNOOTTEE The data structures and code are a mess. We need to clean this
- up.
-
- We now proceed to document the individual calls:
-
- 0wpage
-
- 44..33.. rroooott
-
-
- AArrgguummeennttss
-
- iinn empty
-
- oouutt
-
- struct cfs_root_out {
- ViceFid VFid;
- } cfs_root;
-
-
-
- DDeessccrriippttiioonn This call is made to Venus during the initialization of
- the Coda filesystem. If the result is zero, the cfs_root structure
- contains the ViceFid of the root of the Coda filesystem. If a non-zero
- result is generated, its value is a platform dependent error code
- indicating the difficulty Venus encountered in locating the root of
- the Coda filesystem.
-
- 0wpage
-
- 44..44.. llooookkuupp
-
-
- SSuummmmaarryy Find the ViceFid and type of an object in a directory if it
- exists.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_lookup_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_lookup;
-
-
-
- oouutt
-
- struct cfs_lookup_out {
- ViceFid VFid;
- int vtype;
- } cfs_lookup;
-
-
-
- DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of
- a directory entry. The directory entry requested carries name name
- and Venus will search the directory identified by cfs_lookup_in.VFid.
- The result may indicate that the name does not exist, or that
- difficulty was encountered in finding it (e.g. due to disconnection).
- If the result is zero, the field cfs_lookup_out.VFid contains the
- targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
- type of object the name designates.
-
- The name of the object is an 8 bit character string of maximum length
- CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
-
- It is extremely important to realize that Venus bitwise ors the field
- cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
- not be put in the kernel name cache.
-
- NNOOTTEE The type of the vtype is currently wrong. It should be
- coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
-
- 0wpage
-
- 44..55.. ggeettaattttrr
-
-
- SSuummmmaarryy Get the attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_getattr_in {
- ViceFid VFid;
- struct coda_vattr attr; /* XXXXX */
- } cfs_getattr;
-
-
-
- oouutt
-
- struct cfs_getattr_out {
- struct coda_vattr attr;
- } cfs_getattr;
-
-
-
- DDeessccrriippttiioonn This call returns the attributes of the file identified by
- fid.
-
- EErrrroorrss Errors can occur if the object with fid does not exist, is
- unaccessible or if the caller does not have permission to fetch
- attributes.
-
- NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
- the attributes as well as the Fid for the instantiation of an internal
- "inode" or "FileHandle". A significant improvement in performance on
- such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls
- both at the Venus/kernel interaction level and at the RPC level.
-
- The vattr structure included in the input arguments is superfluous and
- should be removed.
-
- 0wpage
-
- 44..66.. sseettaattttrr
-
-
- SSuummmmaarryy Set the attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_setattr_in {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_setattr;
-
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn The structure attr is filled with attributes to be changed
- in BSD style. Attributes not to be changed are set to -1, apart from
- vtype which is set to VNON. Other are set to the value to be assigned.
- The only attributes which the FS driver may request to change are the
- mode, owner, groupid, atime, mtime and ctime. The return value
- indicates success or failure.
-
- EErrrroorrss A variety of errors can occur. The object may not exist, may
- be inaccessible, or permission may not be granted by Venus.
-
- 0wpage
-
- 44..77.. aacccceessss
-
-
- SSuummmmaarryy
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_access_in {
- ViceFid VFid;
- int flags;
- } cfs_access;
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn Verify if access to the object identified by VFid for
- operations described by flags is permitted. The result indicates if
- access will be granted. It is important to remember that Coda uses
- ACLs to enforce protection and that ultimately the servers, not the
- clients enforce the security of the system. The result of this call
- will depend on whether a _t_o_k_e_n is held by the user.
-
- EErrrroorrss The object may not exist, or the ACL describing the protection
- may not be accessible.
-
- 0wpage
-
- 44..88.. ccrreeaattee
-
-
- SSuummmmaarryy Invoked to create a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_create_in {
- ViceFid VFid;
- struct coda_vattr attr;
- int excl;
- int mode;
- char *name; /* Place holder for data. */
- } cfs_create;
-
-
-
-
- oouutt
-
- struct cfs_create_out {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_create;
-
-
-
- DDeessccrriippttiioonn This upcall is invoked to request creation of a file.
- The file will be created in the directory identified by VFid, its name
- will be name, and the mode will be mode. If excl is set an error will
- be returned if the file already exists. If the size field in attr is
- set to zero the file will be truncated. The uid and gid of the file
- are set by converting the CodaCred to a uid using a macro CRTOUID
- (this macro is platform dependent). Upon success the VFid and
- attributes of the file are returned. The Coda FS Driver will normally
- instantiate a vnode, inode or file handle at kernel level for the new
- object.
-
-
- EErrrroorrss A variety of errors can occur. Permissions may be insufficient.
- If the object exists and is not a file the error EISDIR is returned
- under Unix.
-
- NNOOTTEE The packing of parameters is very inefficient and appears to
- indicate confusion between the system call creat and the VFS operation
- create. The VFS operation create is only called to create new objects.
- This create call differs from the Unix one in that it is not invoked
- to return a file descriptor. The truncate and exclusive options,
- together with the mode, could simply be part of the mode as it is
- under Unix. There should be no flags argument; this is used in open
- (2) to return a file descriptor for READ or WRITE mode.
-
- The attributes of the directory should be returned too, since the size
- and mtime changed.
-
- 0wpage
-
- 44..99.. mmkkddiirr
-
-
- SSuummmmaarryy Create a new directory.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_mkdir_in {
- ViceFid VFid;
- struct coda_vattr attr;
- char *name; /* Place holder for data. */
- } cfs_mkdir;
-
-
-
- oouutt
-
- struct cfs_mkdir_out {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_mkdir;
-
-
-
-
- DDeessccrriippttiioonn This call is similar to create but creates a directory.
- Only the mode field in the input parameters is used for creation.
- Upon successful creation, the attr returned contains the attributes of
- the new directory.
-
- EErrrroorrss As for create.
-
- NNOOTTEE The input parameter should be changed to mode instead of
- attributes.
-
- The attributes of the parent should be returned since the size and
- mtime changes.
-
- 0wpage
-
- 44..1100.. lliinnkk
-
-
- SSuummmmaarryy Create a link to an existing file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_link_in {
- ViceFid sourceFid; /* cnode to link *to* */
- ViceFid destFid; /* Directory in which to place link */
- char *tname; /* Place holder for data. */
- } cfs_link;
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory
- identified by destFid with name tname. The source must reside in the
- target's parent, i.e. the source must be have parent destFid, i.e. Coda
- does not support cross directory hard links. Only the return value is
- relevant. It indicates success or the type of failure.
-
- EErrrroorrss The usual errors can occur.0wpage
-
- 44..1111.. ssyymmlliinnkk
-
-
- SSuummmmaarryy create a symbolic link
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_symlink_in {
- ViceFid VFid; /* Directory to put symlink in */
- char *srcname;
- struct coda_vattr attr;
- char *tname;
- } cfs_symlink;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the
- directory identified by VFid and named tname. It should point to the
- pathname srcname. The attributes of the newly created object are to
- be set to attr.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the target directory should be returned since
- its size changed.
-
- 0wpage
-
- 44..1122.. rreemmoovvee
-
-
- SSuummmmaarryy Remove a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_remove_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_remove;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory
- identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the directory should be returned since its
- mtime and size may change.
-
- 0wpage
-
- 44..1133.. rrmmddiirr
-
-
- SSuummmmaarryy Remove a directory
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rmdir_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_rmdir;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Remove the directory with name name from the directory
- identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the parent directory should be returned since
- its mtime and size may change.
-
- 0wpage
-
- 44..1144.. rreeaaddlliinnkk
-
-
- SSuummmmaarryy Read the value of a symbolic link.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_readlink_in {
- ViceFid VFid;
- } cfs_readlink;
-
-
-
- oouutt
-
- struct cfs_readlink_out {
- int count;
- caddr_t data; /* Place holder for data. */
- } cfs_readlink;
-
-
-
- DDeessccrriippttiioonn This routine reads the contents of symbolic link
- identified by VFid into the buffer data. The buffer data must be able
- to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
-
- EErrrroorrss No unusual errors.
-
- 0wpage
-
- 44..1155.. ooppeenn
-
-
- SSuummmmaarryy Open a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_open_in {
- ViceFid VFid;
- int flags;
- } cfs_open;
-
-
-
- oouutt
-
- struct cfs_open_out {
- dev_t dev;
- ino_t inode;
- } cfs_open;
-
-
-
- DDeessccrriippttiioonn This request asks Venus to place the file identified by
- VFid in its cache and to note that the calling process wishes to open
- it with flags as in open(2). The return value to the kernel differs
- for Unix and Windows systems. For Unix systems the Coda FS Driver is
- informed of the device and inode number of the container file in the
- fields dev and inode. For Windows the path of the container file is
- returned to the kernel.
- EErrrroorrss
-
- NNOOTTEE Currently the cfs_open_out structure is not properly adapted to
- deal with the Windows case. It might be best to implement two
- upcalls, one to open aiming at a container file name, the other at a
- container file inode.
-
- 0wpage
-
- 44..1166.. cclloossee
-
-
- SSuummmmaarryy Close a file, update it on the servers.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_close_in {
- ViceFid VFid;
- int flags;
- } cfs_close;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Close the file identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The flags argument is bogus and not used. However, Venus' code
- has room to deal with an execp input field, probably this field should
- be used to inform Venus that the file was closed but is still memory
- mapped for execution. There are comments about fetching versus not
- fetching the data in Venus vproc_vfscalls. This seems silly. If a
- file is being closed, the data in the container file is to be the new
- data. Here again the execp flag might be in play to create confusion:
- currently Venus might think a file can be flushed from the cache when
- it is still memory mapped. This needs to be understood.
-
- 0wpage
-
- 44..1177.. iiooccttll
-
-
- SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_ioctl_in {
- ViceFid VFid;
- int cmd;
- int len;
- int rwflag;
- char *data; /* Place holder for data. */
- } cfs_ioctl;
-
-
-
- oouutt
-
-
- struct cfs_ioctl_out {
- int len;
- caddr_t data; /* Place holder for data. */
- } cfs_ioctl;
-
-
-
- DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and
- data arguments are filled as usual. flags is not used by Venus.
-
- EErrrroorrss
-
- NNOOTTEE Another bogus parameter. flags is not used. What is the
- business about PREFETCHING in the Venus code?
-
-
- 0wpage
-
- 44..1188.. rreennaammee
-
-
- SSuummmmaarryy Rename a fid.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rename_in {
- ViceFid sourceFid;
- char *srcname;
- ViceFid destFid;
- char *destname;
- } cfs_rename;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Rename the object with name srcname in directory
- sourceFid to destname in destFid. It is important that the names
- srcname and destname are 0 terminated strings. Strings in Unix
- kernels are not always null terminated.
-
- EErrrroorrss
-
- 0wpage
-
- 44..1199.. rreeaaddddiirr
-
-
- SSuummmmaarryy Read directory entries.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_readdir_in {
- ViceFid VFid;
- int count;
- int offset;
- } cfs_readdir;
-
-
-
-
- oouutt
-
- struct cfs_readdir_out {
- int size;
- caddr_t data; /* Place holder for data. */
- } cfs_readdir;
-
-
-
- DDeessccrriippttiioonn Read directory entries from VFid starting at offset and
- read at most count bytes. Returns the data in data and returns
- the size in size.
-
- EErrrroorrss
-
- NNOOTTEE This call is not used. Readdir operations exploit container
- files. We will re-evaluate this during the directory revamp which is
- about to take place.
-
- 0wpage
-
- 44..2200.. vvggeett
-
-
- SSuummmmaarryy instructs Venus to do an FSDB->Get.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_vget_in {
- ViceFid VFid;
- } cfs_vget;
-
-
-
- oouutt
-
- struct cfs_vget_out {
- ViceFid VFid;
- int vtype;
- } cfs_vget;
-
-
-
- DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj
- labelled by VFid.
-
- EErrrroorrss
-
- NNOOTTEE This operation is not used. However, it is extremely useful
- since it can be used to deal with read/write memory mapped files.
- These can be "pinned" in the Venus cache using vget and released with
- inactive.
-
- 0wpage
-
- 44..2211.. ffssyynncc
-
-
- SSuummmmaarryy Tell Venus to update the RVM attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_fsync_in {
- ViceFid VFid;
- } cfs_fsync;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This
- should be called as part of kernel level fsync type calls. The
- result indicates if the syncing was successful.
-
- EErrrroorrss
-
- NNOOTTEE Linux does not implement this call. It should.
-
- 0wpage
-
- 44..2222.. iinnaaccttiivvee
-
-
- SSuummmmaarryy Tell Venus a vnode is no longer in use.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_inactive_in {
- ViceFid VFid;
- } cfs_inactive;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn This operation returns EOPNOTSUPP.
-
- EErrrroorrss
-
- NNOOTTEE This should perhaps be removed.
-
- 0wpage
-
- 44..2233.. rrddwwrr
-
-
- SSuummmmaarryy Read or write from a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rdwr_in {
- ViceFid VFid;
- int rwflag;
- int count;
- int offset;
- int ioflag;
- caddr_t data; /* Place holder for data. */
- } cfs_rdwr;
-
-
-
-
- oouutt
-
- struct cfs_rdwr_out {
- int rwflag;
- int count;
- caddr_t data; /* Place holder for data. */
- } cfs_rdwr;
-
-
-
- DDeessccrriippttiioonn This upcall asks Venus to read or write from a file.
-
- EErrrroorrss
-
- NNOOTTEE It should be removed since it is against the Coda philosophy that
- read/write operations never reach Venus. I have been told the
- operation does not work. It is not currently used.
-
-
- 0wpage
-
- 44..2244.. ooddyymmoouunntt
-
-
- SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount
- point.
-
- AArrgguummeennttss
-
- iinn
-
- struct ody_mount_in {
- char *name; /* Place holder for data. */
- } ody_mount;
-
-
-
- oouutt
-
- struct ody_mount_out {
- ViceFid VFid;
- } ody_mount;
-
-
-
- DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named
- name. The fid is returned in VFid.
-
- EErrrroorrss
-
- NNOOTTEE This call was used by David for dynamic sets. It should be
- removed since it causes a jungle of pointers in the VFS mounting area.
- It is not used by Coda proper. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2255.. ooddyy__llooookkuupp
-
-
- SSuummmmaarryy Looks up something.
-
- AArrgguummeennttss
-
- iinn irrelevant
-
-
- oouutt
- irrelevant
-
- DDeessccrriippttiioonn
-
- EErrrroorrss
-
- NNOOTTEE Gut it. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2266.. ooddyy__eexxppaanndd
-
-
- SSuummmmaarryy expands something in a dynamic set.
-
- AArrgguummeennttss
-
- iinn irrelevant
-
- oouutt
- irrelevant
-
- DDeessccrriippttiioonn
-
- EErrrroorrss
-
- NNOOTTEE Gut it. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2277.. pprreeffeettcchh
-
-
- SSuummmmaarryy Prefetch a dynamic set.
-
- AArrgguummeennttss
-
- iinn Not documented.
-
- oouutt
- Not documented.
-
- DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is
- noted that it doesn't work. Not surprising, since the kernel does not
- have support for it. (ODY_PREFETCH is not a defined operation).
-
- EErrrroorrss
-
- NNOOTTEE Gut it. It isn't working and isn't used by Coda.
-
-
- 0wpage
-
- 44..2288.. ssiiggnnaall
-
-
- SSuummmmaarryy Send Venus a signal about an upcall.
-
- AArrgguummeennttss
-
- iinn none
-
- oouutt
- not applicable.
-
- DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus
- that the calling process received a signal after Venus read the
- message from the input queue. Venus is supposed to clean up the
- operation.
-
- EErrrroorrss No reply is given.
-
- NNOOTTEE We need to better understand what Venus needs to clean up and if
- it is doing this correctly. Also we need to handle multiple upcall
- per system call situations correctly. It would be important to know
- what state changes in Venus take place after an upcall for which the
- kernel is responsible for notifying Venus to clean up (e.g. open
- definitely is such a state change, but many others are maybe not).
-
- 0wpage
-
- 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss
-
-
- The Coda FS Driver can cache results of lookup and access upcalls, to
- limit the frequency of upcalls. Upcalls carry a price since a process
- context switch needs to take place. The counterpart of caching the
- information is that Venus will notify the FS Driver that cached
- entries must be flushed or renamed.
-
- The kernel code generally has to maintain a structure which links the
- internal file handles (called vnodes in BSD, inodes in Linux and
- FileHandles in Windows) with the ViceFid's which Venus maintains. The
- reason is that frequent translations back and forth are needed in
- order to make upcalls and use the results of upcalls. Such linking
- objects are called ccnnooddeess.
-
- The current minicache implementations have cache entries which record
- the following:
-
- 1. the name of the file
-
- 2. the cnode of the directory containing the object
-
- 3. a list of CodaCred's for which the lookup is permitted.
-
- 4. the cnode of the object
-
- The lookup call in the Coda FS Driver may request the cnode of the
- desired object from the cache, by passing its name, directory and the
- CodaCred's of the caller. The cache will return the cnode or indicate
- that it cannot be found. The Coda FS Driver must be careful to
- invalidate cache entries when it modifies or removes objects.
-
- When Venus obtains information that indicates that cache entries are
- no longer valid, it will make a downcall to the kernel. Downcalls are
- intercepted by the Coda FS Driver and lead to cache invalidations of
- the kind described below. The Coda FS Driver does not return an error
- unless the downcall data could not be read into kernel memory.
-
-
- 55..11.. IINNVVAALLIIDDAATTEE
-
-
- No information is available on this call.
-
-
- 55..22.. FFLLUUSSHH
-
-
-
- AArrgguummeennttss None
-
- SSuummmmaarryy Flush the name cache entirely.
-
- DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This
- is to prevent stale cache information being held. Some operating
- systems allow the kernel name cache to be switched off dynamically.
- When this is done, this downcall is made.
-
-
- 55..33.. PPUURRGGEEUUSSEERR
-
-
- AArrgguummeennttss
-
- struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
- struct CodaCred cred;
- } cfs_purgeuser;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This
- call is issued when tokens for a user expire or are flushed.
-
-
- 55..44.. ZZAAPPFFIILLEE
-
-
- AArrgguummeennttss
-
- struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_zapfile;
-
-
-
- DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair.
- This is issued as a result of an invalidation of cached attributes of
- a vnode.
-
- NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache
- zapfile routine takes different arguments. Linux does not implement
- the invalidation of attributes correctly.
-
-
-
- 55..55.. ZZAAPPDDIIRR
-
-
- AArrgguummeennttss
-
- struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_zapdir;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache lying in a directory
- CodaFid, and all children of this directory. This call is issued when
- Venus receives a callback on the directory.
-
-
- 55..66.. ZZAAPPVVNNOODDEE
-
-
-
- AArrgguummeennttss
-
- struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
- struct CodaCred cred;
- ViceFid VFid;
- } cfs_zapvnode;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid
- as in the arguments. This downcall is probably never issued.
-
-
- 55..77.. PPUURRGGEEFFIIDD
-
-
- SSuummmmaarryy
-
- AArrgguummeennttss
-
- struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_purgefid;
-
-
-
- DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd
- vnode), purge its children from the namecache and remove the file from the
- namecache.
-
-
-
- 55..88.. RREEPPLLAACCEE
-
-
- SSuummmmaarryy Replace the Fid's for a collection of names.
-
- AArrgguummeennttss
-
- struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
- ViceFid NewFid;
- ViceFid OldFid;
- } cfs_replace;
-
-
-
- DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with
- another. It is added to allow Venus during reintegration to replace
- locally allocated temp fids while disconnected with global fids even
- when the reference counts on those fids are not zero.
-
- 0wpage
-
- 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp
-
-
- This section gives brief hints as to desirable features for the Coda
- FS Driver at startup and upon shutdown or Venus failures. Before
- entering the discussion it is useful to repeat that the Coda FS Driver
- maintains the following data:
-
-
- 1. message queues
-
- 2. cnodes
-
- 3. name cache entries
-
- The name cache entries are entirely private to the driver, so they
- can easily be manipulated. The message queues will generally have
- clear points of initialization and destruction. The cnodes are
- much more delicate. User processes hold reference counts in Coda
- filesystems and it can be difficult to clean up the cnodes.
-
- It can expect requests through:
-
- 1. the message subsystem
-
- 2. the VFS layer
-
- 3. pioctl interface
-
- Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can
- treat these similarly.
-
-
- 66..11.. RReeqquuiirreemmeennttss
-
-
- The following requirements should be accommodated:
-
- 1. The message queues should have open and close routines. On Unix
- the opening of the character devices are such routines.
-
- +o Before opening, no messages can be placed.
-
- +o Opening will remove any old messages still pending.
-
- +o Close will notify any sleeping processes that their upcall cannot
- be completed.
-
- +o Close will free all memory allocated by the message queues.
-
-
- 2. At open the namecache shall be initialized to empty state.
-
- 3. Before the message queues are open, all VFS operations will fail.
- Fortunately this can be achieved by making sure than mounting the
- Coda filesystem cannot succeed before opening.
-
- 4. After closing of the queues, no VFS operations can succeed. Here
- one needs to be careful, since a few operations (lookup,
- read/write, readdir) can proceed without upcalls. These must be
- explicitly blocked.
-
- 5. Upon closing the namecache shall be flushed and disabled.
-
- 6. All memory held by cnodes can be freed without relying on upcalls.
-
- 7. Unmounting the file system can be done without relying on upcalls.
-
- 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
- get the rootfid or the attributes of the rootfid. The latter is
- best implemented by Venus fetching these objects before attempting
- to mount.
-
- NNOOTTEE NetBSD in particular but also Linux have not implemented the
- above requirements fully. For smooth operation this needs to be
- corrected.
-
-
-
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs.rst
index 16e606c11f40..ac22138de6a4 100644
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs.rst
@@ -1,5 +1,6 @@
-
-configfs - Userspace-driven kernel object configuration.
+=======================================================
+Configfs - Userspace-driven Kernel Object Configuration
+=======================================================
Joel Becker <joel.becker@oracle.com>
@@ -9,7 +10,8 @@ Copyright (c) 2005 Oracle Corporation,
Joel Becker <joel.becker@oracle.com>
-[What is configfs?]
+What is configfs?
+=================
configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of
@@ -35,10 +37,11 @@ kernel modules backing the items must respond to this.
Both sysfs and configfs can and should exist together on the same
system. One is not a replacement for the other.
-[Using configfs]
+Using configfs
+==============
configfs can be compiled as a module or into the kernel. You can access
-it by doing
+it by doing::
mount -t configfs none /config
@@ -56,28 +59,29 @@ values. Don't mix more than one attribute in one attribute file.
There are two types of configfs attributes:
* Normal attributes, which similar to sysfs attributes, are small ASCII text
-files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
-only one value per file should be used, and the same caveats from sysfs apply.
-Configfs expects write(2) to store the entire buffer at once. When writing to
-normal configfs attributes, userspace processes should first read the entire
-file, modify the portions they wish to change, and then write the entire
-buffer back.
+ files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
+ only one value per file should be used, and the same caveats from sysfs apply.
+ Configfs expects write(2) to store the entire buffer at once. When writing to
+ normal configfs attributes, userspace processes should first read the entire
+ file, modify the portions they wish to change, and then write the entire
+ buffer back.
* Binary attributes, which are somewhat similar to sysfs binary attributes,
-but with a few slight changes to semantics. The PAGE_SIZE limitation does not
-apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
-The write(2) calls from user space are buffered, and the attributes'
-write_bin_attribute method will be invoked on the final close, therefore it is
-imperative for user-space to check the return code of close(2) in order to
-verify that the operation finished successfully.
-To avoid a malicious user OOMing the kernel, there's a per-binary attribute
-maximum buffer value.
+ but with a few slight changes to semantics. The PAGE_SIZE limitation does not
+ apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
+ The write(2) calls from user space are buffered, and the attributes'
+ write_bin_attribute method will be invoked on the final close, therefore it is
+ imperative for user-space to check the return code of close(2) in order to
+ verify that the operation finished successfully.
+ To avoid a malicious user OOMing the kernel, there's a per-binary attribute
+ maximum buffer value.
When an item needs to be destroyed, remove it with rmdir(2). An
item cannot be destroyed if any other item has a link to it (via
symlink(2)). Links can be removed via unlink(2).
-[Configuring FakeNBD: an Example]
+Configuring FakeNBD: an Example
+===============================
Imagine there's a Network Block Device (NBD) driver that allows you to
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
@@ -86,14 +90,14 @@ sysadmins use to configure FakeNBD, but somehow that program has to tell
the driver about it. Here's where configfs comes in.
When the FakeNBD driver is loaded, it registers itself with configfs.
-readdir(3) sees this just fine:
+readdir(3) sees this just fine::
# ls /config
fakenbd
A fakenbd connection can be created with mkdir(2). The name is
arbitrary, but likely the tool will make some use of the name. Perhaps
-it is a uuid or a disk name:
+it is a uuid or a disk name::
# mkdir /config/fakenbd/disk1
# ls /config/fakenbd/disk1
@@ -102,7 +106,7 @@ it is a uuid or a disk name:
The target attribute contains the IP address of the server FakeNBD will
connect to. The device attribute is the device on the server.
Predictably, the rw attribute determines whether the connection is
-read-only or read-write.
+read-only or read-write::
# echo 10.0.0.1 > /config/fakenbd/disk1/target
# echo /dev/sda1 > /config/fakenbd/disk1/device
@@ -111,7 +115,8 @@ read-only or read-write.
That's it. That's all there is. Now the device is configured, via the
shell no less.
-[Coding With configfs]
+Coding With configfs
+====================
Every object in configfs is a config_item. A config_item reflects an
object in the subsystem. It has attributes that match values on that
@@ -130,7 +135,10 @@ appears as a directory at the top of the configfs filesystem. A
subsystem is also a config_group, and can do everything a config_group
can.
-[struct config_item]
+struct config_item
+==================
+
+::
struct config_item {
char *ci_name;
@@ -168,7 +176,10 @@ By itself, a config_item cannot do much more than appear in configfs.
Usually a subsystem wants the item to display and/or store attributes,
among other things. For that, it needs a type.
-[struct config_item_type]
+struct config_item_type
+=======================
+
+::
struct configfs_item_operations {
void (*release)(struct config_item *);
@@ -192,7 +203,10 @@ allocated dynamically will need to provide the ct_item_ops->release()
method. This method is called when the config_item's reference count
reaches zero.
-[struct configfs_attribute]
+struct configfs_attribute
+=========================
+
+::
struct configfs_attribute {
char *ca_name;
@@ -212,9 +226,12 @@ filename. configfs_attribute->ca_mode specifies the file permissions.
If an attribute is readable and provides a ->show method, that method will
be called whenever userspace asks for a read(2) on the attribute. If an
attribute is writable and provides a ->store method, that method will be
-be called whenever userspace asks for a write(2) on the attribute.
+called whenever userspace asks for a write(2) on the attribute.
+
+struct configfs_bin_attribute
+=============================
-[struct configfs_bin_attribute]
+::
struct configfs_bin_attribute {
struct configfs_attribute cb_attr;
@@ -236,15 +253,16 @@ to be used.
If binary attribute is readable and the config_item provides a
ct_item_ops->read_bin_attribute() method, that method will be called
whenever userspace asks for a read(2) on the attribute. The converse
-will happen for write(2). The reads/writes are bufferred so only a
+will happen for write(2). The reads/writes are buffered so only a
single read/write will occur; the attributes' need not concern itself
with it.
-[struct config_group]
+struct config_group
+===================
A config_item cannot live in a vacuum. The only way one can be created
is via mkdir(2) on a config_group. This will trigger creation of a
-child item.
+child item::
struct config_group {
struct config_item cg_item;
@@ -264,14 +282,13 @@ The config_group structure contains a config_item. Properly configuring
that item means that a group can behave as an item in its own right.
However, it can do more: it can create child items or groups. This is
accomplished via the group operations specified on the group's
-config_item_type.
+config_item_type::
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group,
const char *name);
struct config_group *(*make_group)(struct config_group *group,
const char *name);
- int (*commit_item)(struct config_item *item);
void (*disconnect_notify)(struct config_group *group,
struct config_item *item);
void (*drop_item)(struct config_group *group,
@@ -279,7 +296,8 @@ config_item_type.
};
A group creates child items by providing the
-ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
+ct_group_ops->make_item() method. If provided, this method is called from
+mkdir(2) in the group's directory. The subsystem allocates a new
config_item (or more likely, its container structure), initializes it,
and returns it to configfs. Configfs will then populate the filesystem
tree to reflect the new item.
@@ -296,13 +314,14 @@ upon item allocation. If a subsystem has no work to do, it may omit
the ct_group_ops->drop_item() method, and configfs will call
config_item_put() on the item on behalf of the subsystem.
-IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
-is called, configfs WILL remove the item from the filesystem tree
-(assuming that it has no children to keep it busy). The subsystem is
-responsible for responding to this. If the subsystem has references to
-the item in other threads, the memory is safe. It may take some time
-for the item to actually disappear from the subsystem's usage. But it
-is gone from configfs.
+Important:
+ drop_item() is void, and as such cannot fail. When rmdir(2)
+ is called, configfs WILL remove the item from the filesystem tree
+ (assuming that it has no children to keep it busy). The subsystem is
+ responsible for responding to this. If the subsystem has references to
+ the item in other threads, the memory is safe. It may take some time
+ for the item to actually disappear from the subsystem's usage. But it
+ is gone from configfs.
When drop_item() is called, the item's linkage has already been torn
down. It no longer has a reference on its parent and has no place in
@@ -319,10 +338,11 @@ is implemented in the configfs rmdir(2) code. ->drop_item() will not be
called, as the item has not been dropped. rmdir(2) will fail, as the
directory is not empty.
-[struct configfs_subsystem]
+struct configfs_subsystem
+=========================
A subsystem must register itself, usually at module_init time. This
-tells configfs to make the subsystem appear in the file tree.
+tells configfs to make the subsystem appear in the file tree::
struct configfs_subsystem {
struct config_group su_group;
@@ -332,17 +352,19 @@ tells configfs to make the subsystem appear in the file tree.
int configfs_register_subsystem(struct configfs_subsystem *subsys);
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
- A subsystem consists of a toplevel config_group and a mutex.
+A subsystem consists of a toplevel config_group and a mutex.
The group is where child config_items are created. For a subsystem,
this group is usually defined statically. Before calling
configfs_register_subsystem(), the subsystem must have initialized the
group via the usual group _init() functions, and it must also have
initialized the mutex.
- When the register call returns, the subsystem is live, and it
+
+When the register call returns, the subsystem is live, and it
will be visible via configfs. At that point, mkdir(2) can be called and
the subsystem must be ready for it.
-[An Example]
+An Example
+==========
The best example of these basic concepts is the simple_children
subsystem/group and the simple_child item in
@@ -350,7 +372,8 @@ samples/configfs/configfs_sample.c. It shows a trivial object displaying
and storing an attribute, and a simple group creating and destroying
these children.
-[Hierarchy Navigation and the Subsystem Mutex]
+Hierarchy Navigation and the Subsystem Mutex
+============================================
There is an extra bonus that configfs provides. The config_groups and
config_items are arranged in a hierarchy due to the fact that they
@@ -375,7 +398,8 @@ be in its parent's cg_children list for the same duration. This allows
a subsystem to trust ci_parent and cg_children while they hold the
mutex.
-[Item Aggregation Via symlink(2)]
+Item Aggregation Via symlink(2)
+===============================
configfs provides a simple group via the group->item parent/child
relationship. Often, however, a larger environment requires aggregation
@@ -403,7 +427,8 @@ A config_item cannot be removed while it links to any other item, nor
can it be removed while an item links to it. Dangling symlinks are not
allowed in configfs.
-[Automatically Created Subgroups]
+Automatically Created Subgroups
+===============================
A new config_group may want to have two types of child config_items.
While this could be codified by magic names in ->make_item(), it is much
@@ -433,7 +458,8 @@ As a consequence of this, default groups cannot be removed directly via
rmdir(2). They also are not considered when rmdir(2) on the parent
group is checking for children.
-[Dependent Subsystems]
+Dependent Subsystems
+====================
Sometimes other drivers depend on particular configfs items. For
example, ocfs2 mounts depend on a heartbeat region item. If that
@@ -459,50 +485,3 @@ up. Here, the heartbeat code calls configfs_depend_item(). If it
succeeds, then heartbeat knows the region is safe to give to ocfs2.
If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error.
-
-[Committable Items]
-
-NOTE: Committable items are currently unimplemented.
-
-Some config_items cannot have a valid initial state. That is, no
-default values can be specified for the item's attributes such that the
-item can do its work. Userspace must configure one or more attributes,
-after which the subsystem can start whatever entity this item
-represents.
-
-Consider the FakeNBD device from above. Without a target address *and*
-a target device, the subsystem has no idea what block device to import.
-The simple example assumes that the subsystem merely waits until all the
-appropriate attributes are configured, and then connects. This will,
-indeed, work, but now every attribute store must check if the attributes
-are initialized. Every attribute store must fire off the connection if
-that condition is met.
-
-Far better would be an explicit action notifying the subsystem that the
-config_item is ready to go. More importantly, an explicit action allows
-the subsystem to provide feedback as to whether the attributes are
-initialized in a way that makes sense. configfs provides this as
-committable items.
-
-configfs still uses only normal filesystem operations. An item is
-committed via rename(2). The item is moved from a directory where it
-can be modified to a directory where it cannot.
-
-Any group that provides the ct_group_ops->commit_item() method has
-committable items. When this group appears in configfs, mkdir(2) will
-not work directly in the group. Instead, the group will have two
-subdirectories: "live" and "pending". The "live" directory does not
-support mkdir(2) or rmdir(2) either. It only allows rename(2). The
-"pending" directory does allow mkdir(2) and rmdir(2). An item is
-created in the "pending" directory. Its attributes can be modified at
-will. Userspace commits the item by renaming it into the "live"
-directory. At this point, the subsystem receives the ->commit_item()
-callback. If all required attributes are filled to satisfaction, the
-method returns zero and the item is moved to the "live" directory.
-
-As rmdir(2) does not work in the "live" directory, an item must be
-shutdown, or "uncommitted". Again, this is done via rename(2), this
-time from the "live" directory back to the "pending" one. The subsystem
-is notified by the ct_group_ops->uncommit_object() method.
-
-
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
new file mode 100644
index 000000000000..719e90f1988e
--- /dev/null
+++ b/Documentation/filesystems/dax.rst
@@ -0,0 +1,307 @@
+=======================
+Direct Access for files
+=======================
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The `DAX` code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports `DAX`, you can make a filesystem
+on it as usual. The `DAX` code currently only supports files with a block
+size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block
+size when creating the filesystem.
+
+Currently 5 filesystems support `DAX`: ext2, ext4, xfs, virtiofs and erofs.
+Enabling `DAX` on them is different.
+
+Enabling DAX on ext2 and erofs
+------------------------------
+
+When mounting the filesystem, use the ``-o dax`` option on the command line or
+add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files
+within the filesystem. It is equivalent to the ``-o dax=always`` behavior below.
+
+
+Enabling DAX on xfs and ext4
+----------------------------
+
+Summary
+-------
+
+ 1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to
+ the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details
+ about this access mode.
+
+ 2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular
+ files and directories. This advisory flag can be set or cleared at any
+ time, but doing so does not immediately affect the `S_DAX` state.
+
+ 3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will
+ be inherited by all regular files and subdirectories that are subsequently
+ created in this directory. Files and subdirectories that exist at the time
+ this flag is set or cleared on the parent directory are not modified by
+ this modification of the parent directory.
+
+ 4. There exist dax mount options which can override `FS_XFLAG_DAX` in the
+ setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the
+ following hold:
+
+ ``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default.
+
+ ``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`."
+
+ ``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`."
+
+ ``-o dax`` is a legacy option which is an alias for ``dax=always``.
+
+ .. warning::
+
+ The option ``-o dax`` may be removed in the future so ``-o dax=always`` is
+ the preferred method for specifying this behavior.
+
+ .. note::
+
+ Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain
+ the same even when the filesystem is mounted with a dax option. However,
+ in-core inode state (`S_DAX`) will be overridden until the filesystem is
+ remounted with dax=inode and the inode is evicted from kernel memory.
+
+ 5. The `S_DAX` policy can be changed via:
+
+ a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are
+ created
+
+ b) Setting the appropriate dax="foo" mount option
+
+ c) Changing the `FS_XFLAG_DAX` flag on existing regular files and
+ directories. This has runtime constraints and limitations that are
+ described in 6) below.
+
+ 6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX`
+ flag, the change to existing regular files won't take effect until the
+ files are closed by all processes.
+
+
+Details
+-------
+
+There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`)
+and the other is a volatile flag indicating the active state of the feature
+(`S_DAX`).
+
+`FS_XFLAG_DAX` is preserved within the filesystem. This persistent config
+setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl
+(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
+
+New files and directories automatically inherit `FS_XFLAG_DAX` from
+their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at
+directory creation time can be used to set a default behavior for an entire
+sub-tree.
+
+To clarify inheritance, here are 3 examples:
+
+Example A:
+
+.. code-block:: shell
+
+ mkdir -p a/b/c
+ xfs_io -c 'chattr +x' a
+ mkdir a/b/c/d
+ mkdir a/e
+
+ ------[outcome]------
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+.. code-block:: shell
+
+ mkdir a
+ xfs_io -c 'chattr +x' a
+ mkdir -p a/b/c/d
+
+ ------[outcome]------
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+.. code-block:: shell
+
+ mkdir -p a/b/c
+ xfs_io -c 'chattr +x' c
+ mkdir a/b/c/d
+
+ ------[outcome]------
+
+ dax: c,d
+ no dax: a,b
+
+The current enabled state (`S_DAX`) is set when a file inode is instantiated in
+memory by the kernel. It is set based on the underlying media support, the
+value of `FS_XFLAG_DAX` and the filesystem's dax mount option.
+
+statx can be used to query `S_DAX`.
+
+.. note::
+
+ That only regular files will ever have `S_DAX` set and therefore statx
+ will never indicate that `S_DAX` is set on directories.
+
+Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even
+if the underlying media does not support dax and/or the filesystem is
+overridden with a mount option.
+
+
+Enabling DAX on virtiofs
+----------------------------
+The semantic of DAX on virtiofs is basically equal to that on ext4 and xfs,
+except that when '-o dax=inode' is specified, virtiofs client derives the hint
+whether DAX shall be enabled or not from virtiofs server through FUSE protocol,
+rather than the persistent `FS_XFLAG_DAX` flag. That is, whether DAX shall be
+enabled or not is completely determined by virtiofs server, while virtiofs
+server itself may deploy various algorithm making this decision, e.g. depending
+on the persistent `FS_XFLAG_DAX` flag on the host.
+
+It is still supported to set or clear persistent `FS_XFLAG_DAX` flag inside
+guest, but it is not guaranteed that DAX will be enabled or disabled for
+corresponding file then. Users inside guest still need to call statx(2) and
+check the statx flag `STATX_ATTR_DAX` to see if DAX is enabled for this file.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support `DAX` in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+- pmem: NVDIMM persistent memory driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of:
+
+* Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in
+ i_flags
+* Implementing ->read_iter and ->write_iter operations which use
+ :c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set
+* Implementing an mmap file operation for `DAX` files which sets the
+ `VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to
+ include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
+ handlers should probably call :c:func:`dax_iomap_fault()` passing the
+ appropriate fault size and iomap operations.
+* Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations
+ instead of :c:func:`block_truncate_page()` for `DAX` files
+* Ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The iomap handlers for allocating blocks must make sure that allocated blocks
+are zeroed out and converted to written extents before being returned to avoid
+exposure of uninitialized data through mmap.
+
+These filesystems may be used for inspiration:
+
+.. seealso::
+
+ ext2: see Documentation/filesystems/ext2.rst
+
+.. seealso::
+
+ xfs: see Documentation/admin-guide/xfs.rst
+
+.. seealso::
+
+ ext4: see Documentation/filesystems/ext4/
+
+
+Handling Media Errors
+---------------------
+
+The libnvdimm subsystem stores a record of known media error locations for
+each pmem block device (in gendisk->badblocks). If we fault at such location,
+or one with a latent error not yet discovered, the application can expect
+to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply
+writing the affected sectors (through the pmem driver, and if the underlying
+NVDIMM supports the clear_poison DSM defined by ACPI).
+
+Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or
+sysadmins have an option to restore the lost data from a prior ``backup/inbuilt``
+redundancy in the following ways:
+
+1. Delete the affected file, and restore from a backup (sysadmin route):
+ This will free the filesystem blocks that were being used by the file,
+ and the next time they're allocated, they will be zeroed first, which
+ happens through the driver, and will clear bad sectors.
+
+2. Truncate or hole-punch the part of the file that has a bad-block (at least
+ an entire aligned sector has to be hole-punched, but not necessarily an
+ entire filesystem block).
+
+These are the two basic paths that allow `DAX` filesystems to continue operating
+in the presence of media errors. More robust error recovery mechanisms can be
+built on top of this in the future, for example, involving redundancy/mirroring
+provided at the block layer through DM, or additionally, at the filesystem
+level. These would have to rely on the above two tenets, that error clearing
+can happen either by sending an IO through the driver, or zeroing (also through
+the driver).
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+`DAX` on a block device that supports `DAX`, they will still be copied into RAM.
+
+The DAX code does not work correctly on architectures which have virtually
+mapped caches such as ARM, MIPS and SPARC.
+
+Calling :c:func:`get_user_pages()` on a range of user memory that has been
+mmapped from a `DAX` file will fail when there are no 'struct page' to describe
+those pages. This problem has been addressed in some device drivers
+by adding optional struct page support for pages under the control of
+the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of
+how to do this). In the non struct page cases `O_DIRECT` reads/writes to
+those memory ranges from a non-`DAX` file will fail
+
+
+.. note::
+
+ `O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that
+ is being accessed that is key here). Other things that will not work in
+ the non struct page case include RDMA, :c:func:`sendfile()` and
+ :c:func:`splice()`.
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
deleted file mode 100644
index 679729442fd2..000000000000
--- a/Documentation/filesystems/dax.txt
+++ /dev/null
@@ -1,132 +0,0 @@
-Direct Access for files
------------------------
-
-Motivation
-----------
-
-The page cache is usually used to buffer reads and writes to files.
-It is also used to provide the pages which are mapped into userspace
-by a call to mmap.
-
-For block devices that are memory-like, the page cache pages would be
-unnecessary copies of the original storage. The DAX code removes the
-extra copy by performing reads and writes directly to the storage device.
-For file mappings, the storage device is mapped directly into userspace.
-
-
-Usage
------
-
-If you have a block device which supports DAX, you can make a filesystem
-on it as usual. The DAX code currently only supports files with a block
-size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem. When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
-
-
-Implementation Tips for Block Driver Writers
---------------------------------------------
-
-To support DAX in your block driver, implement the 'direct_access'
-block device operation. It is used to translate the sector number
-(expressed in units of 512-byte sectors) to a page frame number (pfn)
-that identifies the physical page for the memory. It also returns a
-kernel virtual address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that can be contiguously accessed at that offset. It may also
-return a negative errno if an error occurs.
-
-In order to support this method, the storage must be byte-accessible by
-the CPU at all times. If your device uses paging techniques to expose
-a large amount of memory through a smaller window, then you cannot
-implement direct_access. Equally, if your device can occasionally
-stall the CPU for an extended period, you should also not attempt to
-implement direct_access.
-
-These block devices may be used for inspiration:
-- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
-- pmem: NVDIMM persistent memory driver
-
-
-Implementation Tips for Filesystem Writers
-------------------------------------------
-
-Filesystem support consists of
-- adding support to mark inodes as being DAX by setting the S_DAX flag in
- i_flags
-- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw()
- when inode has S_DAX flag set
-- implementing an mmap file operation for DAX files which sets the
- VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
- include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
- handlers should probably call dax_iomap_fault() passing the appropriate
- fault size and iomap operations.
-- calling iomap_zero_range() passing appropriate iomap operations instead of
- block_truncate_page() for DAX files
-- ensuring that there is sufficient locking between reads, writes,
- truncates and page faults
-
-The iomap handlers for allocating blocks must make sure that allocated blocks
-are zeroed out and converted to written extents before being returned to avoid
-exposure of uninitialized data through mmap.
-
-These filesystems may be used for inspiration:
-- ext2: see Documentation/filesystems/ext2.txt
-- ext4: see Documentation/filesystems/ext4/
-- xfs: see Documentation/admin-guide/xfs.rst
-
-
-Handling Media Errors
----------------------
-
-The libnvdimm subsystem stores a record of known media error locations for
-each pmem block device (in gendisk->badblocks). If we fault at such location,
-or one with a latent error not yet discovered, the application can expect
-to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
-writing the affected sectors (through the pmem driver, and if the underlying
-NVDIMM supports the clear_poison DSM defined by ACPI).
-
-Since DAX IO normally doesn't go through the driver/bio path, applications or
-sysadmins have an option to restore the lost data from a prior backup/inbuilt
-redundancy in the following ways:
-
-1. Delete the affected file, and restore from a backup (sysadmin route):
- This will free the file system blocks that were being used by the file,
- and the next time they're allocated, they will be zeroed first, which
- happens through the driver, and will clear bad sectors.
-
-2. Truncate or hole-punch the part of the file that has a bad-block (at least
- an entire aligned sector has to be hole-punched, but not necessarily an
- entire filesystem block).
-
-These are the two basic paths that allow DAX filesystems to continue operating
-in the presence of media errors. More robust error recovery mechanisms can be
-built on top of this in the future, for example, involving redundancy/mirroring
-provided at the block layer through DM, or additionally, at the filesystem
-level. These would have to rely on the above two tenets, that error clearing
-can happen either by sending an IO through the driver, or zeroing (also through
-the driver).
-
-
-Shortcomings
-------------
-
-Even if the kernel or its modules are stored on a filesystem that supports
-DAX on a block device that supports DAX, they will still be copied into RAM.
-
-The DAX code does not work correctly on architectures which have virtually
-mapped caches such as ARM, MIPS and SPARC.
-
-Calling get_user_pages() on a range of user memory that has been mmaped
-from a DAX file will fail when there are no 'struct page' to describe
-those pages. This problem has been addressed in some device drivers
-by adding optional struct page support for pages under the control of
-the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of
-how to do this). In the non struct page cases O_DIRECT reads/writes to
-those memory ranges from a non-DAX file will fail (note that O_DIRECT
-reads/writes _of a DAX file_ do work, it is the memory that is being
-accessed that is key here). Other things that will not work in the
-non struct page case include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index 6c032db235a5..dc35da8b8792 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -120,8 +120,8 @@ and hexadecimal::
Boolean values can be placed in debugfs with::
- struct dentry *debugfs_create_bool(const char *name, umode_t mode,
- struct dentry *parent, bool *value);
+ void debugfs_create_bool(const char *name, umode_t mode,
+ struct dentry *parent, bool *value);
A read on the resulting file will yield either Y (for non-zero values) or
N, followed by a newline. If written to, it will accept either upper- or
@@ -155,8 +155,8 @@ any code which does so in the mainline. Note that all files created with
debugfs_create_blob() are read-only.
If you want to dump a block of registers (something that happens quite
-often during development, even if little such code reaches mainline.
-Debugfs offers two functions: one to make a registers-only file, and
+often during development, even if little such code reaches mainline),
+debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential
file::
@@ -166,35 +166,40 @@ file::
};
struct debugfs_regset32 {
- struct debugfs_reg32 *regs;
+ const struct debugfs_reg32 *regs;
int nregs;
void __iomem *base;
+ struct device *dev; /* Optional device for Runtime PM */
};
debugfs_create_regset32(const char *name, umode_t mode,
struct dentry *parent,
struct debugfs_regset32 *regset);
- void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs,
+ void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs,
int nregs, void __iomem *base, char *prefix);
The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block.
-If you want to dump an u32 array in debugfs, you can create file with::
+If you want to dump a u32 array in debugfs, you can create a file with::
+
+ struct debugfs_u32_array {
+ u32 *array;
+ u32 n_elements;
+ };
void debugfs_create_u32_array(const char *name, umode_t mode,
struct dentry *parent,
- u32 *array, u32 elements);
+ struct debugfs_u32_array *array);
-The "array" argument provides data, and the "elements" argument is
-the number of elements in the array. Note: Once array is created its
-size can not be changed.
+The "array" argument wraps a pointer to the array's data and the number
+of its elements. Note: Once array is created its size can not be changed.
-There is a helper function to create device related seq_file::
+There is a helper function to create a device-related seq_file::
- struct dentry *debugfs_create_devm_seqfile(struct device *dev,
+ void debugfs_create_devm_seqfile(struct device *dev,
const char *name,
struct dentry *parent,
int (*read_fn)(struct seq_file *s,
diff --git a/Documentation/filesystems/devpts.rst b/Documentation/filesystems/devpts.rst
new file mode 100644
index 000000000000..b6324ab1960d
--- /dev/null
+++ b/Documentation/filesystems/devpts.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+The Devpts Filesystem
+=====================
+
+Each mount of the devpts filesystem is now distinct such that ptys
+and their indices allocated in one mount are independent from ptys
+and their indices in all other mounts.
+
+All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
+with permissions ``0000``.
+
+To retain backwards compatibility the a ptmx device node (aka any node
+created with ``mknod name c 5 2``) when opened will look for an instance
+of devpts under the name ``pts`` in the same directory as the ptmx device
+node.
+
+As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx``
+it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or
+to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using
+the devpts filesystem in this manner devpts should be mounted with
+the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called.
+
+Total count of pty pairs in all instances is limited by sysctls::
+
+ kernel.pty.max = 4096 - global limit
+ kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
+ kernel.pty.nr - current count of ptys
+
+Per-instance limit could be set by adding mount option ``max=<count>``.
+
+This feature was added in kernel 3.4 together with
+``sysctl kernel.pty.reserve``.
+
+In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit.
diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt
deleted file mode 100644
index 9f94fe276dea..000000000000
--- a/Documentation/filesystems/devpts.txt
+++ /dev/null
@@ -1,26 +0,0 @@
-Each mount of the devpts filesystem is now distinct such that ptys
-and their indicies allocated in one mount are independent from ptys
-and their indicies in all other mounts.
-
-All mounts of the devpts filesystem now create a /dev/pts/ptmx node
-with permissions 0000.
-
-To retain backwards compatibility the a ptmx device node (aka any node
-created with "mknod name c 5 2") when opened will look for an instance
-of devpts under the name "pts" in the same directory as the ptmx device
-node.
-
-As an option instead of placing a /dev/ptmx device node at /dev/ptmx
-it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or
-to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using
-the devpts filesystem in this manner devpts should be mounted with
-the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.
-
-Total count of pty pairs in all instances is limited by sysctls:
-kernel.pty.max = 4096 - global limit
-kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
-kernel.pty.nr - current count of ptys
-
-Per-instance limit could be set by adding mount option "max=<count>".
-This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve.
-In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index de12016ee419..05ea387bc9fb 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -11,127 +11,268 @@ When taking the i_rwsem on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following.
-For our purposes all operations fall in 5 classes:
-1) read access. Locking rules: caller locks directory we are accessing.
-The lock is taken shared.
+Primitives
+==========
-2) object creation. Locking rules: same as above, but the lock is taken
-exclusive.
+For our purposes all operations fall in 6 classes:
-3) object removal. Locking rules: caller locks parent, finds victim,
-locks victim and calls the method. Locks are exclusive.
+1. read access. Locking rules:
-4) rename() that is _not_ cross-directory. Locking rules: caller locks
-the parent and finds source and target. In case of exchange (with
-RENAME_EXCHANGE in flags argument) lock both. In any case,
-if the target already exists, lock it. If the source is a non-directory,
-lock it. If we need to lock both, lock them in inode pointer order.
-Then call the method. All locks are exclusive.
-NB: we might get away with locking the the source (and target in exchange
-case) shared.
+ * lock the directory we are accessing (shared)
-5) link creation. Locking rules:
+2. object creation. Locking rules:
- * lock parent
- * check that source is not a directory
- * lock source
- * call the method.
+ * lock the directory we are accessing (exclusive)
-All locks are exclusive.
+3. object removal. Locking rules:
-6) cross-directory rename. The trickiest in the whole bunch. Locking
-rules:
+ * lock the parent (exclusive)
+ * find the victim
+ * lock the victim (exclusive)
- * lock the filesystem
- * lock parents in "ancestors first" order.
- * find source and target.
- * if old parent is equal to or is a descendent of target
- fail with -ENOTEMPTY
- * if new parent is equal to or is a descendent of source
- fail with -ELOOP
- * If it's an exchange, lock both the source and the target.
- * If the target exists, lock it. If the source is a non-directory,
- lock it. If we need to lock both, do so in inode pointer order.
- * call the method.
+4. link creation. Locking rules:
+
+ * lock the parent (exclusive)
+ * check that the source is not a directory
+ * lock the source (exclusive; probably could be weakened to shared)
-All ->i_rwsem are taken exclusive. Again, we might get away with locking
-the the source (and target in exchange case) shared.
+5. rename that is _not_ cross-directory. Locking rules:
-The rules above obviously guarantee that all directories that are going to be
-read, modified or removed by method will be locked by caller.
+ * lock the parent (exclusive)
+ * find the source and target
+ * decide which of the source and target need to be locked.
+ The source needs to be locked if it's a non-directory, target - if it's
+ a non-directory or about to be removed.
+ * take the locks that need to be taken (exlusive), in inode pointer order
+ if need to take both (that can happen only when both source and target
+ are non-directories - the source because it wouldn't need to be locked
+ otherwise and the target because mixing directory and non-directory is
+ allowed only with RENAME_EXCHANGE, and that won't be removing the target).
+6. cross-directory rename. The trickiest in the whole bunch. Locking rules:
+
+ * lock the filesystem
+ * if the parents don't have a common ancestor, fail the operation.
+ * lock the parents in "ancestors first" order (exclusive). If neither is an
+ ancestor of the other, lock the parent of source first.
+ * find the source and target.
+ * verify that the source is not a descendent of the target and
+ target is not a descendent of source; fail the operation otherwise.
+ * lock the subdirectories involved (exclusive), source before target.
+ * lock the non-directories involved (exclusive), in inode pointer order.
+
+The rules above obviously guarantee that all directories that are going
+to be read, modified or removed by method will be locked by the caller.
+
+
+Splicing
+========
+
+There is one more thing to consider - splicing. It's not an operation
+in its own right; it may happen as part of lookup. We speak of the
+operations on directory trees, but we obviously do not have the full
+picture of those - especially for network filesystems. What we have
+is a bunch of subtrees visible in dcache and locking happens on those.
+Trees grow as we do operations; memory pressure prunes them. Normally
+that's not a problem, but there is a nasty twist - what should we do
+when one growing tree reaches the root of another? That can happen in
+several scenarios, starting from "somebody mounted two nested subtrees
+from the same NFS4 server and doing lookups in one of them has reached
+the root of another"; there's also open-by-fhandle stuff, and there's a
+possibility that directory we see in one place gets moved by the server
+to another and we run into it when we do a lookup.
+
+For a lot of reasons we want to have the same directory present in dcache
+only once. Multiple aliases are not allowed. So when lookup runs into
+a subdirectory that already has an alias, something needs to be done with
+dcache trees. Lookup is already holding the parent locked. If alias is
+a root of separate tree, it gets attached to the directory we are doing a
+lookup in, under the name we'd been looking for. If the alias is already
+a child of the directory we are looking in, it changes name to the one
+we'd been looking for. No extra locking is involved in these two cases.
+However, if it's a child of some other directory, the things get trickier.
+First of all, we verify that it is *not* an ancestor of our directory
+and fail the lookup if it is. Then we try to lock the filesystem and the
+current parent of the alias. If either trylock fails, we fail the lookup.
+If trylocks succeed, we detach the alias from its current parent and
+attach to our directory, under the name we are looking for.
+
+Note that splicing does *not* involve any modification of the filesystem;
+all we change is the view in dcache. Moreover, holding a directory locked
+exclusive prevents such changes involving its children and holding the
+filesystem lock prevents any changes of tree topology, other than having a
+root of one tree becoming a child of directory in another. In particular,
+if two dentries have been found to have a common ancestor after taking
+the filesystem lock, their relationship will remain unchanged until
+the lock is dropped. So from the directory operations' point of view
+splicing is almost irrelevant - the only place where it matters is one
+step in cross-directory renames; we need to be careful when checking if
+parents have a common ancestor.
+
+
+Multiple-filesystem stuff
+=========================
+
+For some filesystems a method can involve a directory operation on
+another filesystem; it may be ecryptfs doing operation in the underlying
+filesystem, overlayfs doing something to the layers, network filesystem
+using a local one as a cache, etc. In all such cases the operations
+on other filesystems must follow the same locking rules. Moreover, "a
+directory operation on this filesystem might involve directory operations
+on that filesystem" should be an asymmetric relation (or, if you will,
+it should be possible to rank the filesystems so that directory operation
+on a filesystem could trigger directory operations only on higher-ranked
+ones - in these terms overlayfs ranks lower than its layers, network
+filesystem ranks lower than whatever it caches on, etc.)
+
+
+Deadlock avoidance
+==================
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
- First of all, at any moment we have a partial ordering of the
- objects - A < B iff A is an ancestor of B.
-
- That ordering can change. However, the following is true:
-
-(1) if object removal or non-cross-directory rename holds lock on A and
- attempts to acquire lock on B, A will remain the parent of B until we
- acquire the lock on B. (Proof: only cross-directory rename can change
- the parent of object and it would have to lock the parent).
-
-(2) if cross-directory rename holds the lock on filesystem, order will not
- change until rename acquires all locks. (Proof: other cross-directory
- renames will be blocked on filesystem lock and we don't start changing
- the order until we had acquired all locks).
-
-(3) locks on non-directory objects are acquired only after locks on
- directory objects, and are acquired in inode pointer order.
- (Proof: all operations but renames take lock on at most one
- non-directory object, except renames, which take locks on source and
- target in inode pointer order in the case they are not directories.)
-
-Now consider the minimal deadlock. Each process is blocked on
-attempt to acquire some lock and already holds at least one lock. Let's
-consider the set of contended locks. First of all, filesystem lock is
-not contended, since any process blocked on it is not holding any locks.
-Thus all processes are blocked on ->i_rwsem.
-
-By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address. Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
-
-Thus link creation can't be a part of deadlock - it can't be
-blocked on source and it means that it doesn't hold any locks.
-
-Any contended object is either held by cross-directory rename or
-has a child that is also contended. Indeed, suppose that it is held by
-operation other than cross-directory rename. Then the lock this operation
-is blocked on belongs to child of that object due to (1).
-
-It means that one of the operations is cross-directory rename.
-Otherwise the set of contended objects would be infinite - each of them
-would have a contended child and we had assumed that no object is its
-own descendent. Moreover, there is exactly one cross-directory rename
-(see above).
-
-Consider the object blocking the cross-directory rename. One
-of its descendents is locked by cross-directory rename (otherwise we
-would again have an infinite set of contended objects). But that
-means that cross-directory rename is taking locks out of order. Due
-to (2) the order hadn't changed since we had acquired filesystem lock.
-But locking rules for cross-directory rename guarantee that we do not
-try to acquire lock on descendent before the lock on ancestor.
-Contradiction. I.e. deadlock is impossible. Q.E.D.
-
+There is a ranking on the locks, such that all primitives take
+them in order of non-decreasing rank. Namely,
+
+ * rank ->i_rwsem of non-directories on given filesystem in inode pointer
+ order.
+ * put ->i_rwsem of all directories on a filesystem at the same rank,
+ lower than ->i_rwsem of any non-directory on the same filesystem.
+ * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem
+ on the same filesystem.
+ * among the locks on different filesystems use the relative
+ rank of those filesystems.
+
+For example, if we have NFS filesystem caching on a local one, we have
+
+ 1. ->s_vfs_rename_mutex of NFS filesystem
+ 2. ->i_rwsem of directories on that NFS filesystem, same rank for all
+ 3. ->i_rwsem of non-directories on that filesystem, in order of
+ increasing address of inode
+ 4. ->s_vfs_rename_mutex of local filesystem
+ 5. ->i_rwsem of directories on the local filesystem, same rank for all
+ 6. ->i_rwsem of non-directories on local filesystem, in order of
+ increasing address of inode.
+
+It's easy to verify that operations never take a lock with rank
+lower than that of an already held lock.
+
+Suppose deadlocks are possible. Consider the minimal deadlocked
+set of threads. It is a cycle of several threads, each blocked on a lock
+held by the next thread in the cycle.
+
+Since the locking order is consistent with the ranking, all
+contended locks in the minimal deadlock will be of the same rank,
+i.e. they all will be ->i_rwsem of directories on the same filesystem.
+Moreover, without loss of generality we can assume that all operations
+are done directly to that filesystem and none of them has actually
+reached the method call.
+
+In other words, we have a cycle of threads, T1,..., Tn,
+and the same number of directories (D1,...,Dn) such that
+
+ T1 is blocked on D1 which is held by T2
+
+ T2 is blocked on D2 which is held by T3
+
+ ...
+
+ Tn is blocked on Dn which is held by T1.
+
+Each operation in the minimal cycle must have locked at least
+one directory and blocked on attempt to lock another. That leaves
+only 3 possible operations: directory removal (locks parent, then
+child), same-directory rename killing a subdirectory (ditto) and
+cross-directory rename of some sort.
+
+There must be a cross-directory rename in the set; indeed,
+if all operations had been of the "lock parent, then child" sort
+we would have Dn a parent of D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships couldn't
+have changed since the moment directory locks had been acquired,
+so they would all hold simultaneously at the deadlock time and
+we would have a loop.
+
+Since all operations are on the same filesystem, there can't be
+more than one cross-directory rename among them. Without loss of
+generality we can assume that T1 is the one doing a cross-directory
+rename and everything else is of the "lock parent, then child" sort.
+
+In other words, we have a cross-directory rename that locked
+Dn and blocked on attempt to lock D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships between
+D1,...,Dn all hold simultaneously at the deadlock time. Moreover,
+cross-directory rename does not get to locking any directories until it
+has acquired filesystem lock and verified that directories involved have
+a common ancestor, which guarantees that ancestry relationships between
+all of them had been stable.
+
+Consider the order in which directories are locked by the
+cross-directory rename; parents first, then possibly their children.
+Dn and D1 would have to be among those, with Dn locked before D1.
+Which pair could it be?
+
+It can't be the parents - indeed, since D1 is an ancestor of Dn,
+it would be the first parent to be locked. Therefore at least one of the
+children must be involved and thus neither of them could be a descendent
+of another - otherwise the operation would not have progressed past
+locking the parents.
+
+It can't be a parent and its child; otherwise we would've had
+a loop, since the parents are locked before the children, so the parent
+would have to be a descendent of its child.
+
+It can't be a parent and a child of another parent either.
+Otherwise the child of the parent in question would've been a descendent
+of another child.
+
+That leaves only one possibility - namely, both Dn and D1 are
+among the children, in some order. But that is also impossible, since
+neither of the children is a descendent of another.
+
+That concludes the proof, since the set of operations with the
+properties requiered for a minimal deadlock can not exist.
+
+Note that the check for having a common ancestor in cross-directory
+rename is crucial - without it a deadlock would be possible. Indeed,
+suppose the parents are initially in different trees; we would lock the
+parent of source, then try to lock the parent of target, only to have
+an unrelated lookup splice a distant ancestor of source to some distant
+descendent of the parent of target. At that point we have cross-directory
+rename holding the lock on parent of source and trying to lock its
+distant ancestor. Add a bunch of rmdir() attempts on all directories
+in between (all of those would fail with -ENOTEMPTY, had they ever gotten
+the locks) and voila - we have a deadlock.
+
+Loop avoidance
+==============
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
-Since the only new (parent, child) pair added by rename() is (new parent,
-source), such loop would have to contain these objects and the rest of it
-would have to exist before rename(). I.e. at the moment of loop creation
-rename() responsible for that would be holding filesystem lock and new parent
-would have to be equal to or a descendent of source. But that means that
-new parent had been equal to or a descendent of source since the moment when
-we had acquired filesystem lock and rename() would fail with -ELOOP in that
-case.
+Suppose after the operation there is a loop; since there hadn't been such
+loops before the operation, at least on of the nodes in that loop must've
+had its parent changed. In other words, the loop must be passing through
+the source or, in case of exchange, possibly the target.
+
+Since the operation has succeeded, neither source nor target could have
+been ancestors of each other. Therefore the chain of ancestors starting
+in the parent of source could not have passed through the target and
+vice versa. On the other hand, the chain of ancestors of any node could
+not have passed through the node itself, or we would've had a loop before
+the operation. But everything other than source and target has kept
+the parent after the operation, so the operation does not change the
+chains of ancestors of (ex-)parents of source and target. In particular,
+those chains must end after a finite number of steps.
+
+Now consider the loop created by the operation. It passes through either
+source or target; the next node in the loop would be the ex-parent of
+target or source resp. After that the loop would follow the chain of
+ancestors of that parent. But as we have just shown, that chain must
+end after a finite number of steps, which means that it can't be a part
+of any loop. Q.E.D.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst
index 68daaa7facf9..7e2b1fd471d7 100644
--- a/Documentation/filesystems/dlmfs.rst
+++ b/Documentation/filesystems/dlmfs.rst
@@ -12,7 +12,7 @@ dlmfs is built with OCFS2 as it requires most of its infrastructure.
:Project web page: http://ocfs2.wiki.kernel.org
:Tools web page: https://github.com/markfasheh/ocfs2-tools
-:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+:OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html
All code copyright 2005 Oracle except when otherwise noted.
diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.rst
index 15156883d321..a28a1f9ef79c 100644
--- a/Documentation/filesystems/dnotify.txt
+++ b/Documentation/filesystems/dnotify.rst
@@ -1,5 +1,8 @@
- Linux Directory Notification
- ============================
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Linux Directory Notification
+============================
Stephen Rothwell <sfr@canb.auug.org.au>
@@ -12,6 +15,7 @@ being delivered using signals.
The application decides which "events" it wants to be notified about.
The currently defined events are:
+ ========= =====================================================
DN_ACCESS A file in the directory was accessed (read)
DN_MODIFY A file in the directory was modified (write,truncate)
DN_CREATE A file was created in the directory
@@ -19,6 +23,7 @@ The currently defined events are:
DN_RENAME A file in the directory was renamed
DN_ATTRIB A file in the directory had its attributes
changed (chmod,chown)
+ ========= =====================================================
Usually, the application must reregister after each notification, but
if DN_MULTISHOT is or'ed with the event mask, then the registration will
@@ -36,7 +41,7 @@ especially important if DN_MULTISHOT is specified. Note that SIGRTMIN
is often blocked, so it is better to use (at least) SIGRTMIN + 1.
Implementation expectations (features and bugs :-))
----------------------------
+---------------------------------------------------
The notification should work for any local access to files even if the
actual file system is on a remote server. This implies that remote
@@ -67,4 +72,4 @@ See tools/testing/selftests/filesystems/dnotify_test.c for an example.
NOTE
----
Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
-See Documentation/filesystems/inotify.txt for more information on it.
+See Documentation/filesystems/inotify.rst for more information on it.
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 90ac65683e7e..0551985821b8 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -24,3 +24,20 @@ files that are not well-known standardized variables are created
as immutable files. This doesn't prevent removal - "chattr -i" will work -
but it does prevent this kind of failure from being accomplished
accidentally.
+
+.. warning ::
+ When a content of an UEFI variable in /sys/firmware/efi/efivars is
+ displayed, for example using "hexdump", pay attention that the first
+ 4 bytes of the output represent the UEFI variable attributes,
+ in little-endian format.
+
+ Practically the output of each efivar is composed of:
+
+ +-----------------------------------+
+ |4_bytes_of_attributes + efivar_data|
+ +-----------------------------------+
+
+*See also:*
+
+- Documentation/admin-guide/acpi/ssdt-overlays.rst
+- Documentation/ABI/stable/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index bf145171c2bf..cc4626d6ee4f 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -1,63 +1,100 @@
.. SPDX-License-Identifier: GPL-2.0
======================================
-Enhanced Read-Only File System - EROFS
+EROFS - Enhanced Read-Only File System
======================================
Overview
========
-EROFS file-system stands for Enhanced Read-Only File System. Different
-from other read-only file systems, it aims to be designed for flexibility,
-scalability, but be kept simple and high performance.
+EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a
+generic read-only filesystem solution for various read-only use cases instead
+of just focusing on storage space saving without considering any side effects
+of runtime performance.
-It is designed as a better filesystem solution for the following scenarios:
+It is designed to meet the needs of flexibility, feature extendability and user
+payload friendly, etc. Apart from those, it is still kept as a simple
+random-access friendly high-performance filesystem to get rid of unneeded I/O
+amplification and memory-resident overhead compared to similar approaches.
+
+It is implemented to be a better choice for the following scenarios:
- read-only storage media or
- part of a fully trusted read-only solution, which means it needs to be
immutable and bit-for-bit identical to the official golden image for
- their releases due to security and other considerations and
+ their releases due to security or other considerations and
- - hope to save some extra storage space with guaranteed end-to-end performance
- by using reduced metadata and transparent file compression, especially
- for those embedded devices with limited memory (ex, smartphone);
+ - hope to minimize extra storage space with guaranteed end-to-end performance
+ by using compact layout, transparent file compression and direct access,
+ especially for those embedded devices with limited memory and high-density
+ hosts with numerous containers.
-Here is the main features of EROFS:
+Here are the main features of EROFS:
- Little endian on-disk design;
- - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
+ - Block-based distribution and file-based distribution over fscache are
+ supported;
+
+ - Support multiple devices to refer to external blobs, which can be used
+ for container images;
- - Metadata & data could be mixed by design;
+ - 32-bit block addresses for each device, therefore 16TiB address space at
+ most with 4KiB block size for now;
- - 2 inode versions for different requirements:
+ - Two inode layouts for different requirements:
- ===================== ============ =====================================
+ ===================== ============ ======================================
compact (v1) extended (v2)
- ===================== ============ =====================================
+ ===================== ============ ======================================
Inode metadata size 32 bytes 64 bytes
- Max file size 4 GB 16 EB (also limited by max. vol size)
+ Max file size 4 GiB 16 EiB (also limited by max. vol size)
Max uids/gids 65536 4294967296
- File change time no yes (64 + 32-bit timestamp)
+ Per-inode timestamp no yes (64 + 32-bit timestamp)
Max hardlinks 65536 4294967296
- Metadata reserved 4 bytes 14 bytes
- ===================== ============ =====================================
+ Metadata reserved 8 bytes 18 bytes
+ ===================== ============ ======================================
+
+ - Support extended attributes as an option;
+
+ - Support a bloom filter that speeds up negative extended attribute lookups;
+
+ - Support POSIX.1e ACLs by using extended attributes;
+
+ - Support transparent data compression as an option:
+ LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
+ addition, inplace decompression is also supported to avoid bounce compressed
+ buffers and unnecessary page cache thrashing.
+
+ - Support chunk-based data deduplication and rolling-hash compressed data
+ deduplication;
+
+ - Support tailpacking inline compared to byte-addressed unaligned metadata
+ or smaller block size alternatives;
+
+ - Support merging tail-end data into a special inode as fragments.
- - Support extended attributes (xattrs) as an option;
+ - Support large folios for uncompressed files.
- - Support xattr inline and tail-end data inline for all files;
+ - Support direct I/O on uncompressed files to avoid double caching for loop
+ devices;
- - Support POSIX.1e ACLs by using xattrs;
+ - Support FSDAX on uncompressed images for secure containers and ramdisks in
+ order to get rid of unnecessary page cache.
- - Support transparent file compression as an option:
- LZ4 algorithm with 4 KB fixed-sized output compression for high performance.
+ - Support file-based on-demand loading with the Fscache infrastructure.
The following git tree provides the file system user-space tools under
-development (ex, formatting tool mkfs.erofs):
+development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
+compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
+For more information, please also refer to the documentation site:
+
+- https://erofs.docs.kernel.org
+
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list:
@@ -84,8 +121,23 @@ cache_strategy=%s Select a strategy for cached decompression from now on:
It still does in-place I/O decompression
for the rest compressed physical clusters.
========== =============================================
+dax={always,never} Use direct access (no page cache). See
+ Documentation/filesystems/dax.rst.
+dax A legacy option which is an alias for ``dax=always``.
+device=%s Specify a path to an extra device to be used together.
+fsid=%s Specify a filesystem image ID for Fscache back-end.
+domain_id=%s Specify a domain ID in fscache mode so that different images
+ with the same blobs under a given domain ID can share storage.
=================== =========================================================
+Sysfs Entries
+=============
+
+Information about mounted erofs file systems can be found in /sys/fs/erofs.
+Each mounted filesystem will have a directory in /sys/fs/erofs based on its
+device name (i.e., /sys/fs/erofs/sda).
+(see also Documentation/ABI/testing/sysfs-fs-erofs)
+
On-disk details
===============
@@ -113,31 +165,31 @@ may not. All metadatas can be now observed in two different spaces (views):
::
- |-> aligned with 8B
- |-> followed closely
- + meta_blkaddr blocks |-> another slot
- _____________________________________________________________________
- | ... | inode | xattrs | extents | data inline | ... | inode ...
- |________|_______|(optional)|(optional)|__(optional)_|_____|__________
- |-> aligned with the inode slot size
- . .
- . .
- . .
- . .
- . .
- . .
- .____________________________________________________|-> aligned with 4B
- | xattr_ibody_header | shared xattrs | inline xattrs |
- |____________________|_______________|_______________|
- |-> 12 bytes <-|->x * 4 bytes<-| .
- . . .
- . . .
- . . .
- ._______________________________.______________________.
- | id | id | id | id | ... | id | ent | ... | ent| ... |
- |____|____|____|____|______|____|_____|_____|____|_____|
- |-> aligned with 4B
- |-> aligned with 4B
+ |-> aligned with 8B
+ |-> followed closely
+ + meta_blkaddr blocks |-> another slot
+ _____________________________________________________________________
+ | ... | inode | xattrs | extents | data inline | ... | inode ...
+ |________|_______|(optional)|(optional)|__(optional)_|_____|__________
+ |-> aligned with the inode slot size
+ . .
+ . .
+ . .
+ . .
+ . .
+ . .
+ .____________________________________________________|-> aligned with 4B
+ | xattr_ibody_header | shared xattrs | inline xattrs |
+ |____________________|_______________|_______________|
+ |-> 12 bytes <-|->x * 4 bytes<-| .
+ . . .
+ . . .
+ . . .
+ ._______________________________.______________________.
+ | id | id | id | id | ... | id | ent | ... | ent| ... |
+ |____|____|____|____|______|____|_____|_____|____|_____|
+ |-> aligned with 4B
+ |-> aligned with 4B
Inode could be 32 or 64 bytes, which can be distinguished from a common
field which all inode versions have -- i_format::
@@ -151,15 +203,16 @@ may not. All metadatas can be now observed in two different spaces (views):
| |
|__________________| 64 bytes
- Xattrs, extents, data inline are followed by the corresponding inode with
+ Xattrs, extents, data inline are placed after the corresponding inode with
proper alignment, and they could be optional for different data mappings.
- _currently_ total 4 valid data mappings are supported:
+ _currently_ total 5 data layouts are supported:
== ====================================================================
0 flat file data without data inline (no extent);
1 fixed-sized output data compression (with non-compacted indexes);
2 flat file data with tail packing data inline (no extent);
- 3 fixed-sized output data compression (with compacted indexes, v5.3+).
+ 3 fixed-sized output data compression (with compacted indexes, v5.3+);
+ 4 chunk-based file (v5.15+).
== ====================================================================
The size of the optional xattrs is indicated by i_xattr_count in inode
@@ -175,13 +228,13 @@ may not. All metadatas can be now observed in two different spaces (views):
Each share xattr can also be directly found by the following formula:
xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
- ::
+::
- |-> aligned by 4 bytes
- + xattr_blkaddr blocks |-> aligned with 4 bytes
- _________________________________________________________________________
- | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
- |________|_____________|_____________|_____|______________|_______________
+ |-> aligned by 4 bytes
+ + xattr_blkaddr blocks |-> aligned with 4 bytes
+ _________________________________________________________________________
+ | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
+ |________|_____________|_____________|_____|______________|_______________
Directories
-----------
@@ -193,48 +246,123 @@ algorithm (could refer to the related source code).
::
- ___________________________
- / |
- / ______________|________________
- / / | nameoff1 | nameoffN-1
- ____________.______________._______________v________________v__________
- | dirent | dirent | ... | dirent | filename | filename | ... | filename |
- |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
- \ ^
- \ | * could have
- \ | trailing '\0'
- \________________________| nameoff0
-
- Directory block
+ ___________________________
+ / |
+ / ______________|________________
+ / / | nameoff1 | nameoffN-1
+ ____________.______________._______________v________________v__________
+ | dirent | dirent | ... | dirent | filename | filename | ... | filename |
+ |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
+ \ ^
+ \ | * could have
+ \ | trailing '\0'
+ \________________________| nameoff0
+ Directory block
Note that apart from the offset of the first filename, nameoff0 also indicates
the total number of directory entries in this block since it is no need to
introduce another on-disk field at all.
-Compression
------------
-Currently, EROFS supports 4KB fixed-sized output transparent file compression,
-as illustrated below::
-
- |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
- clusterofs clusterofs clusterofs
- | | | logical data
- _________v_______________________________v_____________________v_______________
- ... | . | | . | | . | ...
- ____|____.________|_____________|________.____|_____________|__.__________|____
- |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
- size size size size size
- . . . .
- . . . .
- . . . .
- _______._____________._____________._____________._____________________
- ... | | | | ... physical data
- _______|_____________|_____________|_____________|_____________________
- |-> cluster <-|-> cluster <-|-> cluster <-|
- size size size
-
-Currently each on-disk physical cluster can contain 4KB (un)compressed data
-at most. For each logical cluster, there is a corresponding on-disk index to
-describe its cluster type, physical cluster address, etc.
-
-See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
+Chunk-based files
+-----------------
+In order to support chunk-based data deduplication, a new inode data layout has
+been supported since Linux v5.15: Files are split in equal-sized data chunks
+with ``extents`` area of the inode metadata indicating how to get the chunk
+data: these can be simply as a 4-byte block address array or in the 8-byte
+chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
+details.)
+
+By the way, chunk-based files are all uncompressed for now.
+
+Long extended attribute name prefixes
+-------------------------------------
+There are use cases where extended attributes with different values can have
+only a few common prefixes (such as overlayfs xattrs). The predefined prefixes
+work inefficiently in both image size and runtime performance in such cases.
+
+The long xattr name prefixes feature is introduced to address this issue. The
+overall idea is that, apart from the existing predefined prefixes, the xattr
+entry could also refer to user-specified long xattr name prefixes, e.g.
+"trusted.overlay.".
+
+When referring to a long xattr name prefix, the highest bit (bit 7) of
+erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole
+represent the index of the referred long name prefix among all long name
+prefixes. Therefore, only the trailing part of the name apart from the long
+xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if
+the full xattr name matches exactly as its long xattr name prefix.
+
+All long xattr prefixes are stored one by one in the packed inode as long as
+the packed inode is valid, or in the meta inode otherwise. The
+xattr_prefix_count (of the on-disk superblock) indicates the total number of
+long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start
+offset of long name prefixes in the packed/meta inode. Note that, long extended
+attribute name prefixes are disabled if xattr_prefix_count is 0.
+
+Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4),
+where len represents the total size of the data part. The data part is actually
+represented by 'struct erofs_xattr_long_prefix', where base_index represents the
+index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for
+"trusted.overlay." long name prefix, while the infix string keeps the string
+after stripping the short prefix, e.g. "overlay." for the example above.
+
+Data compression
+----------------
+EROFS implements fixed-sized output compression which generates fixed-sized
+compressed data blocks from variable-sized input in contrast to other existing
+fixed-sized input solutions. Relatively higher compression ratios can be gotten
+by using fixed-sized output compression since nowadays popular data compression
+algorithms are mostly LZ77-based and such fixed-sized output approach can be
+benefited from the historical dictionary (aka. sliding window).
+
+In details, original (uncompressed) data is turned into several variable-sized
+extents and in the meanwhile, compressed into physical clusters (pclusters).
+In order to record each variable-sized extent, logical clusters (lclusters) are
+introduced as the basic unit of compress indexes to indicate whether a new
+extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
+fixed in block size, as illustrated below::
+
+ |<- variable-sized extent ->|<- VLE ->|
+ clusterofs clusterofs clusterofs
+ | | |
+ _________v_________________________________v_______________________v________
+ ... | . | | . | | . ...
+ ____|____._________|______________|________.___ _|______________|__.________
+ |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
+ (HEAD) (NONHEAD) (HEAD) (NONHEAD) .
+ . CBLKCNT . .
+ . . .
+ . . .
+ _______._____________________________.______________._________________
+ ... | | | | ...
+ _______|______________|______________|______________|_________________
+ |-> big pcluster <-|-> pcluster <-|
+
+A physical cluster can be seen as a container of physical compressed blocks
+which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
+were supported. After big pcluster feature is introduced (available since
+Linux v5.13), pcluster can be a multiple of lcluster size.
+
+For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
+starts and blkaddr is used to seek the compressed data. For each NONHEAD
+lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
+distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
+also a HEAD lcluster except that its data is uncompressed. See the comments
+around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
+
+If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
+well. Let the delta0 of the first NONHEAD lcluster store the compressed block
+count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
+to understand its delta0 is constantly 1, as illustrated below::
+
+ __________________________________________________________
+ | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD |
+ |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
+ |<----- a big pcluster (with CBLKCNT) ------>|<-- -->|
+ a lcluster-sized pcluster (without CBLKCNT) ^
+
+If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
+but it's easy to know the size of such pcluster is 1 lcluster as well.
+
+Since Linux v6.1, each pcluster can be used for multiple variable-sized extents,
+therefore it can be used for compressed data deduplication.
diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst
index d83dbbb162e2..92aae683e16a 100644
--- a/Documentation/filesystems/ext2.rst
+++ b/Documentation/filesystems/ext2.rst
@@ -1,6 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+==============================
The Second Extended Filesystem
==============================
@@ -24,7 +25,7 @@ check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
dax Use direct access (no page cache). See
- Documentation/filesystems/dax.txt.
+ Documentation/filesystems/dax.rst.
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
@@ -58,8 +59,6 @@ acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
noacl Don't support POSIX ACLs.
-nobh Do not attach buffer_heads to file pagecache.
-
quota, usrquota Enable user disk quota support
(requires CONFIG_QUOTA).
diff --git a/Documentation/filesystems/ext4/about.rst b/Documentation/filesystems/ext4/about.rst
index 0aadba052264..cc76b577d2f4 100644
--- a/Documentation/filesystems/ext4/about.rst
+++ b/Documentation/filesystems/ext4/about.rst
@@ -39,6 +39,6 @@ entry.
Other References
----------------
-Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
+Also see https://www.nongnu.org/ext2-doc/ for quite a collection of
information about ext2/3. Here's another old reference:
http://wiki.osdev.org/Ext2
diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/filesystems/ext4/attributes.rst
index 54386a010a8d..87814696a65b 100644
--- a/Documentation/filesystems/ext4/attributes.rst
+++ b/Documentation/filesystems/ext4/attributes.rst
@@ -13,8 +13,8 @@ disappeared as of Linux 3.0.
There are two places where extended attributes can be found. The first
place is between the end of each inode entry and the beginning of the
-next inode entry. For example, if inode.i\_extra\_isize = 28 and
-sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes
+next inode entry. For example, if inode.i_extra_isize = 28 and
+sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes
available for in-inode extended attribute storage. The second place
where extended attributes can be found is in the block pointed to by
``inode.i_file_acl``. As of Linux 3.11, it is not possible for this
@@ -38,8 +38,8 @@ Extended attributes, when stored after the inode, have a header
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_magic
+ - __le32
+ - h_magic
- Magic number for identification, 0xEA020000. This value is set by the
Linux driver, though e2fsprogs doesn't seem to check it(?)
@@ -55,28 +55,28 @@ The beginning of an extended attribute block is in
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_magic
+ - __le32
+ - h_magic
- Magic number for identification, 0xEA020000.
* - 0x4
- - \_\_le32
- - h\_refcount
+ - __le32
+ - h_refcount
- Reference count.
* - 0x8
- - \_\_le32
- - h\_blocks
+ - __le32
+ - h_blocks
- Number of disk blocks used.
* - 0xC
- - \_\_le32
- - h\_hash
+ - __le32
+ - h_hash
- Hash value of all attributes.
* - 0x10
- - \_\_le32
- - h\_checksum
+ - __le32
+ - h_checksum
- Checksum of the extended attribute block.
* - 0x14
- - \_\_u32
- - h\_reserved[2]
+ - __u32
+ - h_reserved[3]
- Zero.
The checksum is calculated against the FS UUID, the 64-bit block number
@@ -100,46 +100,46 @@ Attributes stored inside an inode do not need be stored in sorted order.
- Name
- Description
* - 0x0
- - \_\_u8
- - e\_name\_len
+ - __u8
+ - e_name_len
- Length of name.
* - 0x1
- - \_\_u8
- - e\_name\_index
+ - __u8
+ - e_name_index
- Attribute name index. There is a discussion of this below.
* - 0x2
- - \_\_le16
- - e\_value\_offs
+ - __le16
+ - e_value_offs
- Location of this attribute's value on the disk block where it is stored.
Multiple attributes can share the same value. For an inode attribute
this value is relative to the start of the first entry; for a block this
value is relative to the start of the block (i.e. the header).
* - 0x4
- - \_\_le32
- - e\_value\_inum
+ - __le32
+ - e_value_inum
- The inode where the value is stored. Zero indicates the value is in the
same block as this entry. This field is only used if the
- INCOMPAT\_EA\_INODE feature is enabled.
+ INCOMPAT_EA_INODE feature is enabled.
* - 0x8
- - \_\_le32
- - e\_value\_size
+ - __le32
+ - e_value_size
- Length of attribute value.
* - 0xC
- - \_\_le32
- - e\_hash
+ - __le32
+ - e_hash
- Hash value of attribute name and attribute value. The kernel doesn't
update the hash for in-inode attributes, so for that case this value
must be zero, because e2fsck validates any non-zero hash regardless of
where the xattr lives.
* - 0x10
- char
- - e\_name[e\_name\_len]
+ - e_name[e_name_len]
- Attribute name. Does not include trailing NULL.
Attribute values can follow the end of the entry table. There appears to
be a requirement that they be aligned to 4-byte boundaries. The values
are stored starting at the end of the block and grow towards the
-xattr\_header/xattr\_entry table. When the two collide, the overflow is
+xattr_header/xattr_entry table. When the two collide, the overflow is
put into a separate disk block. If the disk block fills up, the
filesystem returns -ENOSPC.
@@ -167,15 +167,15 @@ the key name. Here is a map of name index values to key prefixes:
* - 1
- “user.”
* - 2
- - “system.posix\_acl\_access”
+ - “system.posix_acl_access”
* - 3
- - “system.posix\_acl\_default”
+ - “system.posix_acl_default”
* - 4
- “trusted.”
* - 6
- “security.”
* - 7
- - “system.” (inline\_data only?)
+ - “system.” (inline_data only?)
* - 8
- “system.richacl” (SuSE kernels only?)
diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst
index 72075aa608e4..976a180b209c 100644
--- a/Documentation/filesystems/ext4/bigalloc.rst
+++ b/Documentation/filesystems/ext4/bigalloc.rst
@@ -23,7 +23,7 @@ means that a block group addresses 32 gigabytes instead of 128 megabytes,
also shrinking the amount of file system overhead for metadata.
The administrator can set a block cluster size at mkfs time (which is
-stored in the s\_log\_cluster\_size field in the superblock); from then
+stored in the s_log_cluster_size field in the superblock); from then
on, the block bitmaps track clusters, not individual blocks. This means
that block groups can be several gigabytes in size (instead of just
128MiB); however, the minimum allocation unit becomes a cluster, not a
diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst
index c7546dbc197a..91c45d86e9bb 100644
--- a/Documentation/filesystems/ext4/bitmaps.rst
+++ b/Documentation/filesystems/ext4/bitmaps.rst
@@ -9,15 +9,15 @@ group.
The inode bitmap records which entries in the inode table are in use.
As with most bitmaps, one bit represents the usage status of one data
-block or inode table entry. This implies a block group size of 8 \*
-number\_of\_bytes\_in\_a\_logical\_block.
+block or inode table entry. This implies a block group size of 8 *
+number_of_bytes_in_a_logical_block.
NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts
of the kernel and e2fsprogs code pretends that the block bitmap contains
zeros (i.e. all blocks in the group are free). However, it is not
necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
the bitmaps and group descriptor live inside the group. Unfortunately,
-ext2fs\_test\_block\_bitmap2() will return '0' for those locations,
+ext2fs_test_block_bitmap2() will return '0' for those locations,
which produces confusing debugfs output.
Inode Table
diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index 3da156633339..ed5a5cac6d40 100644
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
@@ -56,39 +56,39 @@ established that the super block and the group descriptor table, if
present, will be at the beginning of the block group. The bitmaps and
the inode table can be anywhere, and it is quite possible for the
bitmaps to come after the inode table, or for both to be in different
-groups (flex\_bg). Leftover space is used for file data blocks, indirect
+groups (flex_bg). Leftover space is used for file data blocks, indirect
block maps, extent tree blocks, and extended attributes.
Flexible Block Groups
---------------------
Starting in ext4, there is a new feature called flexible block groups
-(flex\_bg). In a flex\_bg, several block groups are tied together as one
+(flex_bg). In a flex_bg, several block groups are tied together as one
logical block group; the bitmap spaces and the inode table space in the
-first block group of the flex\_bg are expanded to include the bitmaps
-and inode tables of all other block groups in the flex\_bg. For example,
-if the flex\_bg size is 4, then group 0 will contain (in order) the
+first block group of the flex_bg are expanded to include the bitmaps
+and inode tables of all other block groups in the flex_bg. For example,
+if the flex_bg size is 4, then group 0 will contain (in order) the
superblock, group descriptors, data block bitmaps for groups 0-3, inode
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
space in group 0 is for file data. The effect of this is to group the
block group metadata close together for faster loading, and to enable
large files to be continuous on disk. Backup copies of the superblock
and group descriptors are always at the beginning of block groups, even
-if flex\_bg is enabled. The number of block groups that make up a
-flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
+if flex_bg is enabled. The number of block groups that make up a
+flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
Meta Block Groups
-----------------
-Without the option META\_BG, for safety concerns, all block group
+Without the option META_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
can have at most 2^27/64 = 2^21 block groups. This limits the entire
-filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
+filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
The solution to this problem is to use the metablock group feature
-(META\_BG), which is already in ext3 for all 2.6 releases. With the
-META\_BG feature, ext4 filesystems are partitioned into many metablock
+(META_BG), which is already in ext3 for all 2.6 releases. With the
+META_BG feature, ext4 filesystems are partitioned into many metablock
groups. Each metablock group is a cluster of block groups whose group
descriptor structures can be stored in a single disk block. For ext4
filesystems with 4 KB block size, a single metablock group partition
@@ -105,12 +105,12 @@ descriptors. Instead, the superblock and a single block group descriptor
block is placed at the beginning of the first, second, and last block
groups in a meta-block group. A meta-block group is a collection of
block groups which can be described by a single block group descriptor
-block. Since the size of the block group descriptor structure is 32
-bytes, a meta-block group contains 32 block groups for filesystems with
-a 1KB block size, and 128 block groups for filesystems with a 4KB
+block. Since the size of the block group descriptor structure is 64
+bytes, a meta-block group contains 16 block groups for filesystems with
+a 1KB block size, and 64 block groups for filesystems with a 4KB
blocksize. Filesystems can either be created using this new block group
descriptor layout, or existing filesystems can be resized on-line, and
-the field s\_first\_meta\_bg in the superblock will indicate the first
+the field s_first_meta_bg in the superblock will indicate the first
block group using this new layout.
Please see an important note about ``BLOCK_UNINIT`` in the section about
@@ -121,15 +121,15 @@ Lazy Block Group Initialization
A new feature for ext4 are three block group descriptor flags that
enable mkfs to skip initializing other parts of the block group
-metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
+metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
that the inode and block bitmaps for that group can be calculated and
therefore the on-disk bitmap blocks are not initialized. This is
generally the case for an empty block group or a block group containing
-only fixed-location block group metadata. The INODE\_ZEROED flag means
+only fixed-location block group metadata. The INODE_ZEROED flag means
that the inode table has been initialized; mkfs will unset this flag and
rely on the kernel to initialize the inode tables in the background.
By not writing zeroes to the bitmaps and inode table, mkfs time is
-reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
-but the dumpe2fs output prints this as “uninit\_bg”. They are the same
+reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
+but the dumpe2fs output prints this as “uninit_bg”. They are the same
thing.
diff --git a/Documentation/filesystems/ext4/blockmap.rst b/Documentation/filesystems/ext4/blockmap.rst
index 30e25750d88a..cc596541ce79 100644
--- a/Documentation/filesystems/ext4/blockmap.rst
+++ b/Documentation/filesystems/ext4/blockmap.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| i.i\_block Offset | Where It Points |
+| i.i_block Offset | Where It Points |
+=====================+==============================================================================================================================================================================================================================+
| 0 to 11 | Direct map to file blocks 0 to 11. |
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
index bd722ecd92d6..b0f80ea87c90 100644
--- a/Documentation/filesystems/ext4/blocks.rst
+++ b/Documentation/filesystems/ext4/blocks.rst
@@ -39,7 +39,7 @@ For 32-bit filesystems, limits are as follows:
- 4TiB
- 8TiB
- 16TiB
- - 256PiB
+ - 256TiB
* - Blocks Per Block Group
- 8,192
- 16,384
diff --git a/Documentation/filesystems/ext4/checksums.rst b/Documentation/filesystems/ext4/checksums.rst
index 5519e253810d..e232749daf5f 100644
--- a/Documentation/filesystems/ext4/checksums.rst
+++ b/Documentation/filesystems/ext4/checksums.rst
@@ -4,7 +4,7 @@ Checksums
---------
Starting in early 2012, metadata checksums were added to all major ext4
-and jbd2 data structures. The associated feature flag is metadata\_csum.
+and jbd2 data structures. The associated feature flag is metadata_csum.
The desired checksum algorithm is indicated in the superblock, though as
of October 2012 the only supported algorithm is crc32c. Some data
structures did not have space to fit a full 32-bit checksum, so only the
@@ -20,7 +20,7 @@ encounters directory blocks that lack sufficient empty space to add a
checksum, it will request that you run ``e2fsck -D`` to have the
directories rebuilt with checksums. This has the added benefit of
removing slack space from the directory files and rebalancing the htree
-indexes. If you \_ignore\_ this step, your directories will not be
+indexes. If you _ignore_ this step, your directories will not be
protected by a checksum!
The following table describes the data elements that go into each type
@@ -35,39 +35,39 @@ of checksum. The checksum function is whatever the superblock describes
- Length
- Ingredients
* - Superblock
- - \_\_le32
+ - __le32
- The entire superblock up to the checksum field. The UUID lives inside
the superblock.
* - MMP
- - \_\_le32
+ - __le32
- UUID + the entire MMP block up to the checksum field.
* - Extended Attributes
- - \_\_le32
+ - __le32
- UUID + the entire extended attribute block. The checksum field is set to
zero.
* - Directory Entries
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the directory block up to the
fake entry enclosing the checksum field.
* - HTREE Nodes
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + all valid extents + HTREE tail.
The checksum field is set to zero.
* - Extents
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the entire extent block up to
the checksum field.
* - Bitmaps
- - \_\_le32 or \_\_le16
+ - __le32 or __le16
- UUID + the entire bitmap. Checksums are stored in the group descriptor,
and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
* - Inodes
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the entire inode. The checksum
field is set to zero. Each inode has its own checksum.
* - Group Descriptors
- - \_\_le16
- - If metadata\_csum, then UUID + group number + the entire descriptor;
- else if gdt\_csum, then crc16(UUID + group number + the entire
+ - __le16
+ - If metadata_csum, then UUID + group number + the entire descriptor;
+ else if gdt_csum, then crc16(UUID + group number + the entire
descriptor). In all cases, only the lower 16 bits are stored.
diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst
index 073940cc64ed..6eece8e31df8 100644
--- a/Documentation/filesystems/ext4/directory.rst
+++ b/Documentation/filesystems/ext4/directory.rst
@@ -42,24 +42,24 @@ is at most 263 bytes long, though on disk you'll need to reference
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- inode
- Number of the inode that this directory entry points to.
* - 0x4
- - \_\_le16
- - rec\_len
+ - __le16
+ - rec_len
- Length of this directory entry. Must be a multiple of 4.
* - 0x6
- - \_\_le16
- - name\_len
+ - __le16
+ - name_len
- Length of the file name.
* - 0x8
- char
- - name[EXT4\_NAME\_LEN]
+ - name[EXT4_NAME_LEN]
- File name.
Since file names cannot be longer than 255 bytes, the new directory
-entry format shortens the name\_len field and uses the space for a file
+entry format shortens the name_len field and uses the space for a file
type flag, probably to avoid having to load every inode during directory
tree traversal. This format is ``ext4_dir_entry_2``, which is at most
263 bytes long, though on disk you'll need to reference
@@ -74,24 +74,24 @@ tree traversal. This format is ``ext4_dir_entry_2``, which is at most
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- inode
- Number of the inode that this directory entry points to.
* - 0x4
- - \_\_le16
- - rec\_len
+ - __le16
+ - rec_len
- Length of this directory entry.
* - 0x6
- - \_\_u8
- - name\_len
+ - __u8
+ - name_len
- Length of the file name.
* - 0x7
- - \_\_u8
- - file\_type
+ - __u8
+ - file_type
- File type code, see ftype_ table below.
* - 0x8
- char
- - name[EXT4\_NAME\_LEN]
+ - name[EXT4_NAME_LEN]
- File name.
.. _ftype:
@@ -121,10 +121,35 @@ The directory file type is one of the following values:
* - 0x7
- Symbolic link.
+To support directories that are both encrypted and casefolded directories, we
+must also include hash information in the directory entry. We append
+``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries
+for dot and dotdot, which are kept the same. The structure follows immediately
+after ``name`` and is included in the size listed by ``rec_len`` If a directory
+entry uses this extension, it may be up to 271 bytes.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Size
+ - Name
+ - Description
+ * - 0x0
+ - __le32
+ - hash
+ - The hash of the directory name
+ * - 0x4
+ - __le32
+ - minor_hash
+ - The minor hash of the directory name
+
+
In order to add checksums to these classic directory blocks, a phony
``struct ext4_dir_entry`` is placed at the end of each leaf block to
hold the checksum. The directory entry is 12 bytes long. The inode
-number and name\_len fields are set to zero to fool old software into
+number and name_len fields are set to zero to fool old software into
ignoring an apparently empty directory entry, and the checksum is stored
in the place where the name normally goes. The structure is
``struct ext4_dir_entry_tail``:
@@ -138,24 +163,24 @@ in the place where the name normally goes. The structure is
- Name
- Description
* - 0x0
- - \_\_le32
- - det\_reserved\_zero1
+ - __le32
+ - det_reserved_zero1
- Inode number, which must be zero.
* - 0x4
- - \_\_le16
- - det\_rec\_len
+ - __le16
+ - det_rec_len
- Length of this directory entry, which must be 12.
* - 0x6
- - \_\_u8
- - det\_reserved\_zero2
+ - __u8
+ - det_reserved_zero2
- Length of the file name, which must be zero.
* - 0x7
- - \_\_u8
- - det\_reserved\_ft
+ - __u8
+ - det_reserved_ft
- File type, which must be 0xDE.
* - 0x8
- - \_\_le32
- - det\_checksum
+ - __le32
+ - det_checksum
- Directory leaf block checksum.
The leaf directory block checksum is calculated against the FS UUID, the
@@ -169,7 +194,7 @@ Hash Tree Directories
A linear array of directory entries isn't great for performance, so a
new feature was added to ext3 to provide a faster (but peculiar)
balanced tree keyed off a hash of the directory entry name. If the
-EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
+EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a
hashed btree (htree) to organize and find directory entries. For
backwards read-only compatibility with ext2, this tree is actually
hidden inside the directory file, masquerading as “empty” directory data
@@ -181,14 +206,14 @@ rest of the directory block is empty so that it moves on.
The root of the tree always lives in the first data block of the
directory. By ext2 custom, the '.' and '..' entries must appear at the
beginning of this first block, so they are put here as two
-``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
+``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of
the root node contains metadata about the tree and finally a hash->block
map to find nodes that are lower in the htree. If
``dx_root.info.indirect_levels`` is non-zero then the htree has two
levels; the data block pointed to by the root node's map is an interior
node, which is indexed by a minor hash. Interior nodes in this tree
contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
-minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
+minor_hash->block map to find leafe nodes. Leaf nodes contain a linear
array of all ``struct ext4_dir_entry_2``; all of these entries
(presumably) hash to the same value. If there is an overflow, the
entries simply overflow into the next leaf node, and the
@@ -220,83 +245,83 @@ of a data block:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- dot.inode
- inode number of this directory.
* - 0x4
- - \_\_le16
- - dot.rec\_len
+ - __le16
+ - dot.rec_len
- Length of this record, 12.
* - 0x6
- u8
- - dot.name\_len
+ - dot.name_len
- Length of the name, 1.
* - 0x7
- u8
- - dot.file\_type
+ - dot.file_type
- File type of this entry, 0x2 (directory) (if the feature flag is set).
* - 0x8
- char
- dot.name[4]
- - “.\\0\\0\\0”
+ - “.\0\0\0”
* - 0xC
- - \_\_le32
+ - __le32
- dotdot.inode
- inode number of parent directory.
* - 0x10
- - \_\_le16
- - dotdot.rec\_len
- - block\_size - 12. The record length is long enough to cover all htree
+ - __le16
+ - dotdot.rec_len
+ - block_size - 12. The record length is long enough to cover all htree
data.
* - 0x12
- u8
- - dotdot.name\_len
+ - dotdot.name_len
- Length of the name, 2.
* - 0x13
- u8
- - dotdot.file\_type
+ - dotdot.file_type
- File type of this entry, 0x2 (directory) (if the feature flag is set).
* - 0x14
- char
- - dotdot\_name[4]
- - “..\\0\\0”
+ - dotdot_name[4]
+ - “..\0\0”
* - 0x18
- - \_\_le32
- - struct dx\_root\_info.reserved\_zero
+ - __le32
+ - struct dx_root_info.reserved_zero
- Zero.
* - 0x1C
- u8
- - struct dx\_root\_info.hash\_version
+ - struct dx_root_info.hash_version
- Hash type, see dirhash_ table below.
* - 0x1D
- u8
- - struct dx\_root\_info.info\_length
+ - struct dx_root_info.info_length
- Length of the tree information, 0x8.
* - 0x1E
- u8
- - struct dx\_root\_info.indirect\_levels
- - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
+ - struct dx_root_info.indirect_levels
+ - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR
feature is set; cannot be larger than 2 otherwise.
* - 0x1F
- u8
- - struct dx\_root\_info.unused\_flags
+ - struct dx_root_info.unused_flags
-
* - 0x20
- - \_\_le16
+ - __le16
- limit
- - Maximum number of dx\_entries that can follow this header, plus 1 for
+ - Maximum number of dx_entries that can follow this header, plus 1 for
the header itself.
* - 0x22
- - \_\_le16
+ - __le16
- count
- - Actual number of dx\_entries that follow this header, plus 1 for the
+ - Actual number of dx_entries that follow this header, plus 1 for the
header itself.
* - 0x24
- - \_\_le32
+ - __le32
- block
- The block number (within the directory file) that goes with hash=0.
* - 0x28
- - struct dx\_entry
+ - struct dx_entry
- entries[0]
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
@@ -322,6 +347,8 @@ The directory hash is one of the following values:
- Half MD4, unsigned.
* - 0x5
- Tea, unsigned.
+ * - 0x6
+ - Siphash.
Interior nodes of an htree are recorded as ``struct dx_node``, which is
also the full length of a data block:
@@ -335,38 +362,38 @@ also the full length of a data block:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- fake.inode
- Zero, to make it look like this entry is not in use.
* - 0x4
- - \_\_le16
- - fake.rec\_len
- - The size of the block, in order to hide all of the dx\_node data.
+ - __le16
+ - fake.rec_len
+ - The size of the block, in order to hide all of the dx_node data.
* - 0x6
- u8
- - name\_len
+ - name_len
- Zero. There is no name for this “unused” directory entry.
* - 0x7
- u8
- - file\_type
+ - file_type
- Zero. There is no file type for this “unused” directory entry.
* - 0x8
- - \_\_le16
+ - __le16
- limit
- - Maximum number of dx\_entries that can follow this header, plus 1 for
+ - Maximum number of dx_entries that can follow this header, plus 1 for
the header itself.
* - 0xA
- - \_\_le16
+ - __le16
- count
- - Actual number of dx\_entries that follow this header, plus 1 for the
+ - Actual number of dx_entries that follow this header, plus 1 for the
header itself.
* - 0xE
- - \_\_le32
+ - __le32
- block
- The block number (within the directory file) that goes with the lowest
hash value of this block. This value is stored in the parent block.
* - 0x12
- - struct dx\_entry
+ - struct dx_entry
- entries[0]
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
@@ -383,11 +410,11 @@ long:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- hash
- Hash code.
* - 0x4
- - \_\_le32
+ - __le32
- block
- Block number (within the directory file, not filesystem blocks) of the
next node in the htree.
@@ -396,13 +423,13 @@ long:
author.)
If metadata checksums are enabled, the last 8 bytes of the directory
-block (precisely the length of one dx\_entry) are used to store a
+block (precisely the length of one dx_entry) are used to store a
``struct dx_tail``, which contains the checksum. The ``limit`` and
-``count`` entries in the dx\_root/dx\_node structures are adjusted as
-necessary to fit the dx\_tail into the block. If there is no space for
-the dx\_tail, the user is notified to run e2fsck -D to rebuild the
+``count`` entries in the dx_root/dx_node structures are adjusted as
+necessary to fit the dx_tail into the block. If there is no space for
+the dx_tail, the user is notified to run e2fsck -D to rebuild the
directory index (which will ensure that there's space for the checksum.
-The dx\_tail structure is 8 bytes long and looks like this:
+The dx_tail structure is 8 bytes long and looks like this:
.. list-table::
:widths: 8 8 24 40
@@ -414,13 +441,13 @@ The dx\_tail structure is 8 bytes long and looks like this:
- Description
* - 0x0
- u32
- - dt\_reserved
+ - dt_reserved
- Zero.
* - 0x4
- - \_\_le32
- - dt\_checksum
+ - __le32
+ - dt_checksum
- Checksum of the htree directory block.
The checksum is calculated against the FS UUID, the htree index header
-(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
-use, and the tail block (dx\_tail).
+(dx_root or dx_node), all of the htree indices (dx_entry) that are in
+use, and the tail block (dx_tail).
diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/filesystems/ext4/eainode.rst
index ecc0d01a0a72..7a2ef26b064a 100644
--- a/Documentation/filesystems/ext4/eainode.rst
+++ b/Documentation/filesystems/ext4/eainode.rst
@@ -5,14 +5,14 @@ Large Extended Attribute Values
To enable ext4 to store extended attribute values that do not fit in the
inode or in the single extended attribute block attached to an inode,
-the EA\_INODE feature allows us to store the value in the data blocks of
+the EA_INODE feature allows us to store the value in the data blocks of
a regular file inode. This “EA inode” is linked only from the extended
attribute name index and must not appear in a directory entry. The
-inode's i\_atime field is used to store a checksum of the xattr value;
-and i\_ctime/i\_version store a 64-bit reference count, which enables
+inode's i_atime field is used to store a checksum of the xattr value;
+and i_ctime/i_version store a 64-bit reference count, which enables
sharing of large xattr values between multiple owning inodes. For
backward compatibility with older versions of this feature, the
-i\_mtime/i\_generation *may* store a back-reference to the inode number
-and i\_generation of the **one** owning inode (in cases where the EA
+i_mtime/i_generation *may* store a back-reference to the inode number
+and i_generation of the **one** owning inode (in cases where the EA
inode is not referenced by multiple inodes) to verify that the EA inode
is the correct one being accessed.
diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/filesystems/ext4/globals.rst
index 368bf7662b96..b17418974fd3 100644
--- a/Documentation/filesystems/ext4/globals.rst
+++ b/Documentation/filesystems/ext4/globals.rst
@@ -11,3 +11,4 @@ have static metadata at fixed locations.
.. include:: bitmaps.rst
.. include:: mmp.rst
.. include:: journal.rst
+.. include:: orphan.rst
diff --git a/Documentation/filesystems/ext4/group_descr.rst b/Documentation/filesystems/ext4/group_descr.rst
index 7ba6114e7f5c..392ec44f8fb0 100644
--- a/Documentation/filesystems/ext4/group_descr.rst
+++ b/Documentation/filesystems/ext4/group_descr.rst
@@ -7,34 +7,34 @@ Each block group on the filesystem has one of these descriptors
associated with it. As noted in the Layout section above, the group
descriptors (if present) are the second item in the block group. The
standard configuration is for each block group to contain a full copy of
-the block group descriptor table unless the sparse\_super feature flag
+the block group descriptor table unless the sparse_super feature flag
is set.
Notice how the group descriptor records the location of both bitmaps and
the inode table (i.e. they can float). This means that within a block
group, the only data structures with fixed locations are the superblock
-and the group descriptor table. The flex\_bg mechanism uses this
+and the group descriptor table. The flex_bg mechanism uses this
property to group several block groups into a flex group and lay out all
of the groups' bitmaps and inode tables into one long run in the first
group of the flex group.
-If the meta\_bg feature flag is set, then several block groups are
-grouped together into a meta group. Note that in the meta\_bg case,
+If the meta_bg feature flag is set, then several block groups are
+grouped together into a meta group. Note that in the meta_bg case,
however, the first and last two block groups within the larger meta
group contain only group descriptors for the groups inside the meta
group.
-flex\_bg and meta\_bg do not appear to be mutually exclusive features.
+flex_bg and meta_bg do not appear to be mutually exclusive features.
In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the
block group descriptor was only 32 bytes long and therefore ends at
-bg\_checksum. On an ext4 filesystem with the 64bit feature enabled, the
+bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the
block group descriptor expands to at least the 64 bytes described below;
the size is stored in the superblock.
-If gdt\_csum is set and metadata\_csum is not set, the block group
+If gdt_csum is set and metadata_csum is not set, the block group
checksum is the crc16 of the FS UUID, the group number, and the group
-descriptor structure. If metadata\_csum is set, then the block group
+descriptor structure. If metadata_csum is set, then the block group
checksum is the lower 16 bits of the checksum of the FS UUID, the group
number, and the group descriptor structure. Both block and inode bitmap
checksums are calculated against the FS UUID, the group number, and the
@@ -51,59 +51,59 @@ The block group descriptor is laid out in ``struct ext4_group_desc``.
- Name
- Description
* - 0x0
- - \_\_le32
- - bg\_block\_bitmap\_lo
+ - __le32
+ - bg_block_bitmap_lo
- Lower 32-bits of location of block bitmap.
* - 0x4
- - \_\_le32
- - bg\_inode\_bitmap\_lo
+ - __le32
+ - bg_inode_bitmap_lo
- Lower 32-bits of location of inode bitmap.
* - 0x8
- - \_\_le32
- - bg\_inode\_table\_lo
+ - __le32
+ - bg_inode_table_lo
- Lower 32-bits of location of inode table.
* - 0xC
- - \_\_le16
- - bg\_free\_blocks\_count\_lo
+ - __le16
+ - bg_free_blocks_count_lo
- Lower 16-bits of free block count.
* - 0xE
- - \_\_le16
- - bg\_free\_inodes\_count\_lo
+ - __le16
+ - bg_free_inodes_count_lo
- Lower 16-bits of free inode count.
* - 0x10
- - \_\_le16
- - bg\_used\_dirs\_count\_lo
+ - __le16
+ - bg_used_dirs_count_lo
- Lower 16-bits of directory count.
* - 0x12
- - \_\_le16
- - bg\_flags
+ - __le16
+ - bg_flags
- Block group flags. See the bgflags_ table below.
* - 0x14
- - \_\_le32
- - bg\_exclude\_bitmap\_lo
+ - __le32
+ - bg_exclude_bitmap_lo
- Lower 32-bits of location of snapshot exclusion bitmap.
* - 0x18
- - \_\_le16
- - bg\_block\_bitmap\_csum\_lo
+ - __le16
+ - bg_block_bitmap_csum_lo
- Lower 16-bits of the block bitmap checksum.
* - 0x1A
- - \_\_le16
- - bg\_inode\_bitmap\_csum\_lo
+ - __le16
+ - bg_inode_bitmap_csum_lo
- Lower 16-bits of the inode bitmap checksum.
* - 0x1C
- - \_\_le16
- - bg\_itable\_unused\_lo
+ - __le16
+ - bg_itable_unused_lo
- Lower 16-bits of unused inode count. If set, we needn't scan past the
- ``(sb.s_inodes_per_group - gdt.bg_itable_unused)``\ th entry in the
+ ``(sb.s_inodes_per_group - gdt.bg_itable_unused)`` th entry in the
inode table for this group.
* - 0x1E
- - \_\_le16
- - bg\_checksum
- - Group descriptor checksum; crc16(sb\_uuid+group\_num+bg\_desc) if the
- RO\_COMPAT\_GDT\_CSUM feature is set, or
- crc32c(sb\_uuid+group\_num+bg\_desc) & 0xFFFF if the
- RO\_COMPAT\_METADATA\_CSUM feature is set. The bg\_checksum
- field in bg\_desc is skipped when calculating crc16 checksum,
+ - __le16
+ - bg_checksum
+ - Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the
+ RO_COMPAT_GDT_CSUM feature is set, or
+ crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the
+ RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum
+ field in bg_desc is skipped when calculating crc16 checksum,
and set to zero if crc32c checksum is used.
* -
-
@@ -111,48 +111,48 @@ The block group descriptor is laid out in ``struct ext4_group_desc``.
- These fields only exist if the 64bit feature is enabled and s_desc_size
> 32.
* - 0x20
- - \_\_le32
- - bg\_block\_bitmap\_hi
+ - __le32
+ - bg_block_bitmap_hi
- Upper 32-bits of location of block bitmap.
* - 0x24
- - \_\_le32
- - bg\_inode\_bitmap\_hi
+ - __le32
+ - bg_inode_bitmap_hi
- Upper 32-bits of location of inodes bitmap.
* - 0x28
- - \_\_le32
- - bg\_inode\_table\_hi
+ - __le32
+ - bg_inode_table_hi
- Upper 32-bits of location of inodes table.
* - 0x2C
- - \_\_le16
- - bg\_free\_blocks\_count\_hi
+ - __le16
+ - bg_free_blocks_count_hi
- Upper 16-bits of free block count.
* - 0x2E
- - \_\_le16
- - bg\_free\_inodes\_count\_hi
+ - __le16
+ - bg_free_inodes_count_hi
- Upper 16-bits of free inode count.
* - 0x30
- - \_\_le16
- - bg\_used\_dirs\_count\_hi
+ - __le16
+ - bg_used_dirs_count_hi
- Upper 16-bits of directory count.
* - 0x32
- - \_\_le16
- - bg\_itable\_unused\_hi
+ - __le16
+ - bg_itable_unused_hi
- Upper 16-bits of unused inode count.
* - 0x34
- - \_\_le32
- - bg\_exclude\_bitmap\_hi
+ - __le32
+ - bg_exclude_bitmap_hi
- Upper 32-bits of location of snapshot exclusion bitmap.
* - 0x38
- - \_\_le16
- - bg\_block\_bitmap\_csum\_hi
+ - __le16
+ - bg_block_bitmap_csum_hi
- Upper 16-bits of the block bitmap checksum.
* - 0x3A
- - \_\_le16
- - bg\_inode\_bitmap\_csum\_hi
+ - __le16
+ - bg_inode_bitmap_csum_hi
- Upper 16-bits of the inode bitmap checksum.
* - 0x3C
- - \_\_u32
- - bg\_reserved
+ - __u32
+ - bg_reserved
- Padding to 64 bytes.
.. _bgflags:
@@ -166,8 +166,8 @@ Block group flags can be any combination of the following:
* - Value
- Description
* - 0x1
- - inode table and bitmap are not initialized (EXT4\_BG\_INODE\_UNINIT).
+ - inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT).
* - 0x2
- - block bitmap is not initialized (EXT4\_BG\_BLOCK\_UNINIT).
+ - block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT).
* - 0x4
- - inode table is zeroed (EXT4\_BG\_INODE\_ZEROED).
+ - inode table is zeroed (EXT4_BG_INODE_ZEROED).
diff --git a/Documentation/filesystems/ext4/ifork.rst b/Documentation/filesystems/ext4/ifork.rst
index b9816d5a896b..dc31f505e6c8 100644
--- a/Documentation/filesystems/ext4/ifork.rst
+++ b/Documentation/filesystems/ext4/ifork.rst
@@ -1,6 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
-The Contents of inode.i\_block
+The Contents of inode.i_block
------------------------------
Depending on the type of file an inode describes, the 60 bytes of
@@ -47,7 +47,7 @@ In ext4, the file to logical block map has been replaced with an extent
tree. Under the old scheme, allocating a contiguous run of 1,000 blocks
requires an indirect block to map all 1,000 entries; with extents, the
mapping is reduced to a single ``struct ext4_extent`` with
-``ee_len = 1000``. If flex\_bg is enabled, it is possible to allocate
+``ee_len = 1000``. If flex_bg is enabled, it is possible to allocate
very large files with a single extent, at a considerable reduction in
metadata block use, and some improvement in disk efficiency. The inode
must have the extents flag (0x80000) flag set for this feature to be in
@@ -76,28 +76,28 @@ which is 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le16
- - eh\_magic
+ - __le16
+ - eh_magic
- Magic number, 0xF30A.
* - 0x2
- - \_\_le16
- - eh\_entries
+ - __le16
+ - eh_entries
- Number of valid entries following the header.
* - 0x4
- - \_\_le16
- - eh\_max
+ - __le16
+ - eh_max
- Maximum number of entries that could follow the header.
* - 0x6
- - \_\_le16
- - eh\_depth
+ - __le16
+ - eh_depth
- Depth of this extent node in the extent tree. 0 = this extent node
points to data blocks; otherwise, this extent node points to other
extent nodes. The extent tree can be at most 5 levels deep: a logical
block number can be at most ``2^32``, and the smallest ``n`` that
satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5.
* - 0x8
- - \_\_le32
- - eh\_generation
+ - __le32
+ - eh_generation
- Generation of the tree. (Used by Lustre, but not standard ext4).
Internal nodes of the extent tree, also known as index nodes, are
@@ -112,22 +112,22 @@ recorded as ``struct ext4_extent_idx``, and are 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le32
- - ei\_block
+ - __le32
+ - ei_block
- This index node covers file blocks from 'block' onward.
* - 0x4
- - \_\_le32
- - ei\_leaf\_lo
+ - __le32
+ - ei_leaf_lo
- Lower 32-bits of the block number of the extent node that is the next
level lower in the tree. The tree node pointed to can be either another
internal node or a leaf node, described below.
* - 0x8
- - \_\_le16
- - ei\_leaf\_hi
+ - __le16
+ - ei_leaf_hi
- Upper 16-bits of the previous field.
* - 0xA
- - \_\_u16
- - ei\_unused
+ - __u16
+ - ei_unused
-
Leaf nodes of the extent tree are recorded as ``struct ext4_extent``,
@@ -142,24 +142,24 @@ and are also 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le32
- - ee\_block
+ - __le32
+ - ee_block
- First file block number that this extent covers.
* - 0x4
- - \_\_le16
- - ee\_len
+ - __le16
+ - ee_len
- Number of blocks covered by extent. If the value of this field is <=
32768, the extent is initialized. If the value of the field is > 32768,
the extent is uninitialized and the actual extent length is ``ee_len`` -
32768. Therefore, the maximum length of a initialized extent is 32768
blocks, and the maximum length of an uninitialized extent is 32767.
* - 0x6
- - \_\_le16
- - ee\_start\_hi
+ - __le16
+ - ee_start_hi
- Upper 16-bits of the block number to which this extent points.
* - 0x8
- - \_\_le32
- - ee\_start\_lo
+ - __le32
+ - ee_start_lo
- Lower 32-bits of the block number to which this extent points.
Prior to the introduction of metadata checksums, the extent header +
@@ -182,8 +182,8 @@ including) the checksum itself.
- Name
- Description
* - 0x0
- - \_\_le32
- - eb\_checksum
+ - __le32
+ - eb_checksum
- Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock)
Inline Data
diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst
index d1075178ce0b..a728af0d2fd0 100644
--- a/Documentation/filesystems/ext4/inlinedata.rst
+++ b/Documentation/filesystems/ext4/inlinedata.rst
@@ -11,12 +11,12 @@ file is smaller than 60 bytes, then the data are stored inline in
attribute space, then it might be found as an extended attribute
“system.data” within the inode body (“ibody EA”). This of course
constrains the amount of extended attributes one can attach to an inode.
-If the data size increases beyond i\_block + ibody EA, a regular block
+If the data size increases beyond i_block + ibody EA, a regular block
is allocated and the contents moved to that block.
Pending a change to compact the extended attribute key used to store
inline data, one ought to be able to store 160 bytes of data in a
-256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to
+256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to
that, the limit was 156 bytes due to inefficient use of inode space.
The inline data feature requires the presence of an extended attribute
@@ -25,12 +25,12 @@ for “system.data”, even if the attribute value is zero length.
Inline Directories
~~~~~~~~~~~~~~~~~~
-The first four bytes of i\_block are the inode number of the parent
+The first four bytes of i_block are the inode number of the parent
directory. Following that is a 56-byte space for an array of directory
entries; see ``struct ext4_dir_entry``. If there is a “system.data”
attribute in the inode body, the EA value is an array of
``struct ext4_dir_entry`` as well. Note that for inline directories, the
-i\_block and EA space are treated as separate dirent blocks; directory
+i_block and EA space are treated as separate dirent blocks; directory
entries cannot span the two.
Inline directory entries are not checksummed, as the inode checksum
diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst
index a65baffb4ebf..cfc6c1659931 100644
--- a/Documentation/filesystems/ext4/inodes.rst
+++ b/Documentation/filesystems/ext4/inodes.rst
@@ -38,138 +38,138 @@ The inode table entry is laid out in ``struct ext4_inode``.
- Name
- Description
* - 0x0
- - \_\_le16
- - i\_mode
+ - __le16
+ - i_mode
- File mode. See the table i_mode_ below.
* - 0x2
- - \_\_le16
- - i\_uid
+ - __le16
+ - i_uid
- Lower 16-bits of Owner UID.
* - 0x4
- - \_\_le32
- - i\_size\_lo
+ - __le32
+ - i_size_lo
- Lower 32-bits of size in bytes.
* - 0x8
- - \_\_le32
- - i\_atime
- - Last access time, in seconds since the epoch. However, if the EA\_INODE
+ - __le32
+ - i_atime
+ - Last access time, in seconds since the epoch. However, if the EA_INODE
inode flag is set, this inode stores an extended attribute value and
this field contains the checksum of the value.
* - 0xC
- - \_\_le32
- - i\_ctime
+ - __le32
+ - i_ctime
- Last inode change time, in seconds since the epoch. However, if the
- EA\_INODE inode flag is set, this inode stores an extended attribute
+ EA_INODE inode flag is set, this inode stores an extended attribute
value and this field contains the lower 32 bits of the attribute value's
reference count.
* - 0x10
- - \_\_le32
- - i\_mtime
+ - __le32
+ - i_mtime
- Last data modification time, in seconds since the epoch. However, if the
- EA\_INODE inode flag is set, this inode stores an extended attribute
+ EA_INODE inode flag is set, this inode stores an extended attribute
value and this field contains the number of the inode that owns the
extended attribute.
* - 0x14
- - \_\_le32
- - i\_dtime
+ - __le32
+ - i_dtime
- Deletion Time, in seconds since the epoch.
* - 0x18
- - \_\_le16
- - i\_gid
+ - __le16
+ - i_gid
- Lower 16-bits of GID.
* - 0x1A
- - \_\_le16
- - i\_links\_count
+ - __le16
+ - i_links_count
- Hard link count. Normally, ext4 does not permit an inode to have more
than 65,000 hard links. This applies to files as well as directories,
which means that there cannot be more than 64,998 subdirectories in a
directory (each subdirectory's '..' entry counts as a hard link, as does
- the '.' entry in the directory itself). With the DIR\_NLINK feature
+ the '.' entry in the directory itself). With the DIR_NLINK feature
enabled, ext4 supports more than 64,998 subdirectories by setting this
field to 1 to indicate that the number of hard links is not known.
* - 0x1C
- - \_\_le32
- - i\_blocks\_lo
- - Lower 32-bits of “block” count. If the huge\_file feature flag is not
+ - __le32
+ - i_blocks_lo
+ - Lower 32-bits of “block” count. If the huge_file feature flag is not
set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
- on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
+ on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
- << 32)`` 512-byte blocks on disk. If huge\_file is set and
- EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
+ << 32)`` 512-byte blocks on disk. If huge_file is set and
+ EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
disk.
* - 0x20
- - \_\_le32
- - i\_flags
+ - __le32
+ - i_flags
- Inode flags. See the table i_flags_ below.
* - 0x24
- 4 bytes
- - i\_osd1
+ - i_osd1
- See the table i_osd1_ for more details.
* - 0x28
- 60 bytes
- - i\_block[EXT4\_N\_BLOCKS=15]
- - Block map or extent tree. See the section “The Contents of inode.i\_block”.
+ - i_block[EXT4_N_BLOCKS=15]
+ - Block map or extent tree. See the section “The Contents of inode.i_block”.
* - 0x64
- - \_\_le32
- - i\_generation
+ - __le32
+ - i_generation
- File version (for NFS).
* - 0x68
- - \_\_le32
- - i\_file\_acl\_lo
+ - __le32
+ - i_file_acl_lo
- Lower 32-bits of extended attribute block. ACLs are of course one of
many possible extended attributes; I think the name of this field is a
result of the first use of extended attributes being for ACLs.
* - 0x6C
- - \_\_le32
- - i\_size\_high / i\_dir\_acl
+ - __le32
+ - i_size_high / i_dir_acl
- Upper 32-bits of file/directory size. In ext2/3 this field was named
- i\_dir\_acl, though it was usually set to zero and never used.
+ i_dir_acl, though it was usually set to zero and never used.
* - 0x70
- - \_\_le32
- - i\_obso\_faddr
+ - __le32
+ - i_obso_faddr
- (Obsolete) fragment address.
* - 0x74
- 12 bytes
- - i\_osd2
+ - i_osd2
- See the table i_osd2_ for more details.
* - 0x80
- - \_\_le16
- - i\_extra\_isize
+ - __le16
+ - i_extra_isize
- Size of this inode - 128. Alternately, the size of the extended inode
fields beyond the original ext2 inode, including this field.
* - 0x82
- - \_\_le16
- - i\_checksum\_hi
+ - __le16
+ - i_checksum_hi
- Upper 16-bits of the inode checksum.
* - 0x84
- - \_\_le32
- - i\_ctime\_extra
+ - __le32
+ - i_ctime_extra
- Extra change time bits. This provides sub-second precision. See Inode
Timestamps section.
* - 0x88
- - \_\_le32
- - i\_mtime\_extra
+ - __le32
+ - i_mtime_extra
- Extra modification time bits. This provides sub-second precision.
* - 0x8C
- - \_\_le32
- - i\_atime\_extra
+ - __le32
+ - i_atime_extra
- Extra access time bits. This provides sub-second precision.
* - 0x90
- - \_\_le32
- - i\_crtime
+ - __le32
+ - i_crtime
- File creation time, in seconds since the epoch.
* - 0x94
- - \_\_le32
- - i\_crtime\_extra
+ - __le32
+ - i_crtime_extra
- Extra file creation time bits. This provides sub-second precision.
* - 0x98
- - \_\_le32
- - i\_version\_hi
+ - __le32
+ - i_version_hi
- Upper 32-bits for version number.
* - 0x9C
- - \_\_le32
- - i\_projid
+ - __le32
+ - i_projid
- Project ID.
.. _i_mode:
@@ -183,45 +183,45 @@ The ``i_mode`` value is a combination of the following flags:
* - Value
- Description
* - 0x1
- - S\_IXOTH (Others may execute)
+ - S_IXOTH (Others may execute)
* - 0x2
- - S\_IWOTH (Others may write)
+ - S_IWOTH (Others may write)
* - 0x4
- - S\_IROTH (Others may read)
+ - S_IROTH (Others may read)
* - 0x8
- - S\_IXGRP (Group members may execute)
+ - S_IXGRP (Group members may execute)
* - 0x10
- - S\_IWGRP (Group members may write)
+ - S_IWGRP (Group members may write)
* - 0x20
- - S\_IRGRP (Group members may read)
+ - S_IRGRP (Group members may read)
* - 0x40
- - S\_IXUSR (Owner may execute)
+ - S_IXUSR (Owner may execute)
* - 0x80
- - S\_IWUSR (Owner may write)
+ - S_IWUSR (Owner may write)
* - 0x100
- - S\_IRUSR (Owner may read)
+ - S_IRUSR (Owner may read)
* - 0x200
- - S\_ISVTX (Sticky bit)
+ - S_ISVTX (Sticky bit)
* - 0x400
- - S\_ISGID (Set GID)
+ - S_ISGID (Set GID)
* - 0x800
- - S\_ISUID (Set UID)
+ - S_ISUID (Set UID)
* -
- These are mutually-exclusive file types:
* - 0x1000
- - S\_IFIFO (FIFO)
+ - S_IFIFO (FIFO)
* - 0x2000
- - S\_IFCHR (Character device)
+ - S_IFCHR (Character device)
* - 0x4000
- - S\_IFDIR (Directory)
+ - S_IFDIR (Directory)
* - 0x6000
- - S\_IFBLK (Block device)
+ - S_IFBLK (Block device)
* - 0x8000
- - S\_IFREG (Regular file)
+ - S_IFREG (Regular file)
* - 0xA000
- - S\_IFLNK (Symbolic link)
+ - S_IFLNK (Symbolic link)
* - 0xC000
- - S\_IFSOCK (Socket)
+ - S_IFSOCK (Socket)
.. _i_flags:
@@ -234,56 +234,56 @@ The ``i_flags`` field is a combination of these values:
* - Value
- Description
* - 0x1
- - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
+ - This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
* - 0x2
- This file should be preserved, should undeletion be desired
- (EXT4\_UNRM\_FL). (not implemented)
+ (EXT4_UNRM_FL). (not implemented)
* - 0x4
- - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
+ - File is compressed (EXT4_COMPR_FL). (not really implemented)
* - 0x8
- - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
+ - All writes to the file must be synchronous (EXT4_SYNC_FL).
* - 0x10
- - File is immutable (EXT4\_IMMUTABLE\_FL).
+ - File is immutable (EXT4_IMMUTABLE_FL).
* - 0x20
- - File can only be appended (EXT4\_APPEND\_FL).
+ - File can only be appended (EXT4_APPEND_FL).
* - 0x40
- - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
+ - The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
* - 0x80
- - Do not update access time (EXT4\_NOATIME\_FL).
+ - Do not update access time (EXT4_NOATIME_FL).
* - 0x100
- - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
+ - Dirty compressed file (EXT4_DIRTY_FL). (not used)
* - 0x200
- - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
+ - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
* - 0x400
- - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
+ - Do not compress file (EXT4_NOCOMPR_FL). (not used)
* - 0x800
- - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
- EXT4\_ECOMPR\_FL (compression error), which was never used.
+ - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
+ EXT4_ECOMPR_FL (compression error), which was never used.
* - 0x1000
- - Directory has hashed indexes (EXT4\_INDEX\_FL).
+ - Directory has hashed indexes (EXT4_INDEX_FL).
* - 0x2000
- - AFS magic directory (EXT4\_IMAGIC\_FL).
+ - AFS magic directory (EXT4_IMAGIC_FL).
* - 0x4000
- File data must always be written through the journal
- (EXT4\_JOURNAL\_DATA\_FL).
+ (EXT4_JOURNAL_DATA_FL).
* - 0x8000
- - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
+ - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
* - 0x10000
- All directory entry data should be written synchronously (see
- ``dirsync``) (EXT4\_DIRSYNC\_FL).
+ ``dirsync``) (EXT4_DIRSYNC_FL).
* - 0x20000
- - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
+ - Top of directory hierarchy (EXT4_TOPDIR_FL).
* - 0x40000
- - This is a huge file (EXT4\_HUGE\_FILE\_FL).
+ - This is a huge file (EXT4_HUGE_FILE_FL).
* - 0x80000
- - Inode uses extents (EXT4\_EXTENTS\_FL).
+ - Inode uses extents (EXT4_EXTENTS_FL).
* - 0x100000
- - Verity protected file (EXT4\_VERITY\_FL).
+ - Verity protected file (EXT4_VERITY_FL).
* - 0x200000
- Inode stores a large extended attribute value in its data blocks
- (EXT4\_EA\_INODE\_FL).
+ (EXT4_EA_INODE_FL).
* - 0x400000
- - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
+ - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
(deprecated)
* - 0x01000000
- Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
@@ -294,21 +294,21 @@ The ``i_flags`` field is a combination of these values:
- Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
mainline)
* - 0x10000000
- - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
+ - Inode has inline data (EXT4_INLINE_DATA_FL).
* - 0x20000000
- - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
+ - Create children with the same project ID (EXT4_PROJINHERIT_FL).
* - 0x80000000
- - Reserved for ext4 library (EXT4\_RESERVED\_FL).
+ - Reserved for ext4 library (EXT4_RESERVED_FL).
* -
- Aggregate flags:
* - 0x705BDFFF
- User-visible flags.
* - 0x604BC0FF
- - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
- EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
- EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
+ - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
+ EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
+ EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
these flags in a special manner and they are masked out of the set of
- flags that are saved directly to i\_flags.
+ flags that are saved directly to i_flags.
.. _i_osd1:
@@ -325,9 +325,9 @@ Linux:
- Name
- Description
* - 0x0
- - \_\_le32
- - l\_i\_version
- - Inode version. However, if the EA\_INODE inode flag is set, this inode
+ - __le32
+ - l_i_version
+ - Inode version. However, if the EA_INODE inode flag is set, this inode
stores an extended attribute value and this field contains the upper 32
bits of the attribute value's reference count.
@@ -342,8 +342,8 @@ Hurd:
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_i\_translator
+ - __le32
+ - h_i_translator
- ??
Masix:
@@ -357,8 +357,8 @@ Masix:
- Name
- Description
* - 0x0
- - \_\_le32
- - m\_i\_reserved
+ - __le32
+ - m_i_reserved
- ??
.. _i_osd2:
@@ -376,30 +376,30 @@ Linux:
- Name
- Description
* - 0x0
- - \_\_le16
- - l\_i\_blocks\_high
+ - __le16
+ - l_i_blocks_high
- Upper 16-bits of the block count. Please see the note attached to
- i\_blocks\_lo.
+ i_blocks_lo.
* - 0x2
- - \_\_le16
- - l\_i\_file\_acl\_high
+ - __le16
+ - l_i_file_acl_high
- Upper 16-bits of the extended attribute block (historically, the file
ACL location). See the Extended Attributes section below.
* - 0x4
- - \_\_le16
- - l\_i\_uid\_high
+ - __le16
+ - l_i_uid_high
- Upper 16-bits of the Owner UID.
* - 0x6
- - \_\_le16
- - l\_i\_gid\_high
+ - __le16
+ - l_i_gid_high
- Upper 16-bits of the GID.
* - 0x8
- - \_\_le16
- - l\_i\_checksum\_lo
+ - __le16
+ - l_i_checksum_lo
- Lower 16-bits of the inode checksum.
* - 0xA
- - \_\_le16
- - l\_i\_reserved
+ - __le16
+ - l_i_reserved
- Unused.
Hurd:
@@ -413,24 +413,24 @@ Hurd:
- Name
- Description
* - 0x0
- - \_\_le16
- - h\_i\_reserved1
+ - __le16
+ - h_i_reserved1
- ??
* - 0x2
- - \_\_u16
- - h\_i\_mode\_high
+ - __u16
+ - h_i_mode_high
- Upper 16-bits of the file mode.
* - 0x4
- - \_\_le16
- - h\_i\_uid\_high
+ - __le16
+ - h_i_uid_high
- Upper 16-bits of the Owner UID.
* - 0x6
- - \_\_le16
- - h\_i\_gid\_high
+ - __le16
+ - h_i_gid_high
- Upper 16-bits of the GID.
* - 0x8
- - \_\_u32
- - h\_i\_author
+ - __u32
+ - h_i_author
- Author code?
Masix:
@@ -444,17 +444,17 @@ Masix:
- Name
- Description
* - 0x0
- - \_\_le16
- - h\_i\_reserved1
+ - __le16
+ - h_i_reserved1
- ??
* - 0x2
- - \_\_u16
- - m\_i\_file\_acl\_high
+ - __u16
+ - m_i_file_acl_high
- Upper 16-bits of the extended attribute block (historically, the file
ACL location).
* - 0x4
- - \_\_u32
- - m\_i\_reserved2[2]
+ - __u32
+ - m_i_reserved2[2]
- ??
Inode Size
@@ -466,11 +466,11 @@ In ext2 and ext3, the inode structure size was fixed at 128 bytes
on-disk inode at format time for all inodes in the filesystem to provide
space beyond the end of the original ext2 inode. The on-disk inode
record size is recorded in the superblock as ``s_inode_size``. The
-number of bytes actually used by struct ext4\_inode beyond the original
+number of bytes actually used by struct ext4_inode beyond the original
128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
-inode, which allows struct ext4\_inode to grow for a new kernel without
+inode, which allows struct ext4_inode to grow for a new kernel without
having to upgrade all of the on-disk inodes. Access to fields beyond
-EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
+EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
of August 2019) the inode structure is 160 bytes
(``i_extra_isize = 32``). The extra space between the end of the inode
@@ -498,11 +498,11 @@ structure -- inode change time (ctime), access time (atime), data
modification time (mtime), and deletion time (dtime). The four fields
are 32-bit signed integers that represent seconds since the Unix epoch
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
-January 2038. For inodes that are not linked from any directory but are
-still open (orphan inodes), the dtime field is overloaded for use with
-the orphan list. The superblock field ``s_last_orphan`` points to the
-first inode in the orphan list; dtime is then the number of the next
-orphaned inode, or zero if there are no more orphans.
+January 2038. If the filesystem does not have orphan_file feature, inodes
+that are not linked from any directory but are still open (orphan inodes) have
+the dtime field overloaded for use with the orphan list. The superblock field
+``s_last_orphan`` points to the first inode in the orphan list; dtime is then
+the number of the next orphaned inode, or zero if there are no more orphans.
If the inode structure size ``sb->s_inode_size`` is larger than 128
bytes and the ``i_inode_extra`` field is large enough to encompass the
@@ -516,7 +516,7 @@ creation time (crtime); this field is 64-bits wide and decoded in the
same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
through the regular stat() interface, though debugfs will report them.
-We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
+We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
In other words:
.. list-table::
@@ -525,8 +525,8 @@ In other words:
* - Extra epoch bits
- MSB of 32-bit time
- - Adjustment for signed 32-bit to 64-bit tv\_sec
- - Decoded 64-bit tv\_sec
+ - Adjustment for signed 32-bit to 64-bit tv_sec
+ - Decoded 64-bit tv_sec
- valid time range
* - 0 0
- 1
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..6e8fb2d4b46f 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -4,14 +4,14 @@ Journal (jbd2)
--------------
Introduced in ext3, the ext4 filesystem employs a journal to protect the
-filesystem against corruption in the case of a system crash. A small
-continuous region of disk (default 128MiB) is reserved inside the
-filesystem as a place to land “important” data writes on-disk as quickly
-as possible. Once the important data transaction is fully written to the
-disk and flushed from the disk write cache, a record of the data being
-committed is also written to the journal. At some later point in time,
-the journal code writes the transactions to their final locations on
-disk (this could involve a lot of seeking or a lot of small
+filesystem against metadata inconsistencies in the case of a system crash. Up
+to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
+size limits) can be reserved inside the filesystem as a place to land
+“important” data writes on-disk as quickly as possible. Once the important
+data transaction is fully written to the disk and flushed from the disk write
+cache, a record of the data being committed is also written to the journal. At
+some later point in time, the journal code writes the transactions to their
+final locations on disk (this could involve a lot of seeking or a lot of small
read-write-erases) before erasing the commit record. Should the system
crash during the second slow write, the journal can be replayed all the
way to the latest commit record, guaranteeing the atomicity of whatever
@@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks to the journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at mkfs time.
+
The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
@@ -52,8 +63,8 @@ Generally speaking, the journal has this format:
:header-rows: 1
* - Superblock
- - descriptor\_block (data\_blocks or revocation\_block) [more data or
- revocations] commmit\_block
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
- [more transactions...]
* -
- One transaction
@@ -82,8 +93,8 @@ superblock.
* - 1024 bytes of padding
- ext4 Superblock
- Journal Superblock
- - descriptor\_block (data\_blocks or revocation\_block) [more data or
- revocations] commmit\_block
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
- [more transactions...]
* -
-
@@ -106,17 +117,17 @@ Every block in the journal starts with a common 12-byte header
- Name
- Description
* - 0x0
- - \_\_be32
- - h\_magic
+ - __be32
+ - h_magic
- jbd2 magic number, 0xC03B3998.
* - 0x4
- - \_\_be32
- - h\_blocktype
+ - __be32
+ - h_blocktype
- Description of what this block contains. See the jbd2_blocktype_ table
below.
* - 0x8
- - \_\_be32
- - h\_sequence
+ - __be32
+ - h_sequence
- The transaction ID that goes with this block.
.. _jbd2_blocktype:
@@ -166,95 +177,104 @@ which is 1024 bytes long:
-
- Static information describing the journal.
* - 0x0
- - journal\_header\_t (12 bytes)
- - s\_header
+ - journal_header_t (12 bytes)
+ - s_header
- Common header identifying this as a superblock.
* - 0xC
- - \_\_be32
- - s\_blocksize
+ - __be32
+ - s_blocksize
- Journal device block size.
* - 0x10
- - \_\_be32
- - s\_maxlen
+ - __be32
+ - s_maxlen
- Total number of blocks in this journal.
* - 0x14
- - \_\_be32
- - s\_first
+ - __be32
+ - s_first
- First block of log information.
* -
-
-
- Dynamic information describing the current state of the log.
* - 0x18
- - \_\_be32
- - s\_sequence
+ - __be32
+ - s_sequence
- First commit ID expected in log.
* - 0x1C
- - \_\_be32
- - s\_start
+ - __be32
+ - s_start
- Block number of the start of log. Contrary to the comments, this field
being zero does not imply that the journal is clean!
* - 0x20
- - \_\_be32
- - s\_errno
- - Error value, as set by jbd2\_journal\_abort().
+ - __be32
+ - s_errno
+ - Error value, as set by jbd2_journal_abort().
* -
-
-
- The remaining fields are only valid in a v2 superblock.
* - 0x24
- - \_\_be32
- - s\_feature\_compat;
+ - __be32
+ - s_feature_compat;
- Compatible feature set. See the table jbd2_compat_ below.
* - 0x28
- - \_\_be32
- - s\_feature\_incompat
+ - __be32
+ - s_feature_incompat
- Incompatible feature set. See the table jbd2_incompat_ below.
* - 0x2C
- - \_\_be32
- - s\_feature\_ro\_compat
+ - __be32
+ - s_feature_ro_compat
- Read-only compatible feature set. There aren't any of these currently.
* - 0x30
- - \_\_u8
- - s\_uuid[16]
+ - __u8
+ - s_uuid[16]
- 128-bit uuid for journal. This is compared against the copy in the ext4
super block at mount time.
* - 0x40
- - \_\_be32
- - s\_nr\_users
+ - __be32
+ - s_nr_users
- Number of file systems sharing this journal.
* - 0x44
- - \_\_be32
- - s\_dynsuper
+ - __be32
+ - s_dynsuper
- Location of dynamic super block copy. (Not used?)
* - 0x48
- - \_\_be32
- - s\_max\_transaction
+ - __be32
+ - s_max_transaction
- Limit of journal blocks per transaction. (Not used?)
* - 0x4C
- - \_\_be32
- - s\_max\_trans\_data
+ - __be32
+ - s_max_trans_data
- Limit of data blocks per transaction. (Not used?)
* - 0x50
- - \_\_u8
- - s\_checksum\_type
+ - __u8
+ - s_checksum_type
- Checksum algorithm used for the journal. See jbd2_checksum_type_ for
more info.
* - 0x51
- - \_\_u8[3]
- - s\_padding2
+ - __u8[3]
+ - s_padding2
-
* - 0x54
- - \_\_u32
- - s\_padding[42]
+ - __be32
+ - s_num_fc_blocks
+ - Number of fast commit blocks in the journal.
+ * - 0x58
+ - __be32
+ - s_head
+ - Block number of the head (first unused block) of the journal, only
+ up-to-date when the journal is empty.
+ * - 0x5C
+ - __u32
+ - s_padding[40]
-
* - 0xFC
- - \_\_be32
- - s\_checksum
+ - __be32
+ - s_checksum
- Checksum of the entire superblock, with this field set to zero.
* - 0x100
- - \_\_u8
- - s\_users[16\*48]
+ - __u8
+ - s_users[16*48]
- ids of all file systems sharing the log. e2fsprogs/Linux don't allow
shared external journals, but I imagine Lustre (or ocfs2?), which use
the jbd2 code, might.
@@ -271,7 +291,7 @@ The journal compat features are any combination of the following:
- Description
* - 0x1
- Journal maintains checksums on the data blocks.
- (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
+ (JBD2_FEATURE_COMPAT_CHECKSUM)
.. _jbd2_incompat:
@@ -284,21 +304,23 @@ The journal incompat features are any combination of the following:
* - Value
- Description
* - 0x1
- - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
+ - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
* - 0x2
- Journal can deal with 64-bit block numbers.
- (JBD2\_FEATURE\_INCOMPAT\_64BIT)
+ (JBD2_FEATURE_INCOMPAT_64BIT)
* - 0x4
- - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
+ - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
* - 0x8
- This journal uses v2 of the checksum on-disk format. Each journal
metadata block gets its own checksum, and the block tags in the
descriptor table contain checksums for each of the data blocks in the
- journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
+ journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
* - 0x10
- This journal uses v3 of the checksum on-disk format. This is the same as
v2, but the journal block tag size is fixed regardless of the size of
- block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
+ block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
+ * - 0x20
+ - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
.. _jbd2_checksum_type:
@@ -338,11 +360,11 @@ Descriptor blocks consume at least 36 bytes, but use a full block:
- Name
- Descriptor
* - 0x0
- - journal\_header\_t
+ - journal_header_t
- (open coded)
- Common block header.
* - 0xC
- - struct journal\_block\_tag\_s
+ - struct journal_block_tag_s
- open coded array[]
- Enough tags either to fill up the block or to describe all the data
blocks that follow this descriptor block.
@@ -350,7 +372,7 @@ Descriptor blocks consume at least 36 bytes, but use a full block:
Journal block tags have any of the following formats, depending on which
journal feature and block tag flags are set.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
defined as ``struct journal_block_tag3_s``, which looks like the
following. The size is 16 or 32 bytes.
@@ -363,24 +385,24 @@ following. The size is 16 or 32 bytes.
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_blocknr
+ - __be32
+ - t_blocknr
- Lower 32-bits of the location of where the corresponding data block
should end up on disk.
* - 0x4
- - \_\_be32
- - t\_flags
+ - __be32
+ - t_flags
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
more info.
* - 0x8
- - \_\_be32
- - t\_blocknr\_high
+ - __be32
+ - t_blocknr_high
- Upper 32-bits of the location of where the corresponding data block
- should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
+ should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
not enabled.
* - 0xC
- - \_\_be32
- - t\_checksum
+ - __be32
+ - t_checksum
- Checksum of the journal UUID, the sequence number, and the data block.
* -
-
@@ -416,7 +438,7 @@ The journal tag flags are any combination of the following:
* - 0x8
- This is the last tag in this descriptor block.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
is defined as ``struct journal_block_tag_s``, which looks like the
following. The size is 8, 12, 24, or 28 bytes:
@@ -429,18 +451,18 @@ following. The size is 8, 12, 24, or 28 bytes:
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_blocknr
+ - __be32
+ - t_blocknr
- Lower 32-bits of the location of where the corresponding data block
should end up on disk.
* - 0x4
- - \_\_be16
- - t\_checksum
+ - __be16
+ - t_checksum
- Checksum of the journal UUID, the sequence number, and the data block.
Note that only the lower 16 bits are stored.
* - 0x6
- - \_\_be16
- - t\_flags
+ - __be16
+ - t_flags
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
more info.
* -
@@ -449,8 +471,8 @@ following. The size is 8, 12, 24, or 28 bytes:
- This next field is only present if the super block indicates support for
64-bit block numbers.
* - 0x8
- - \_\_be32
- - t\_blocknr\_high
+ - __be32
+ - t_blocknr_high
- Upper 32-bits of the location of where the corresponding data block
should end up on disk.
* -
@@ -466,8 +488,8 @@ following. The size is 8, 12, 24, or 28 bytes:
``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
field.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
-JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
``struct jbd2_journal_block_tail``, which looks like this:
.. list-table::
@@ -479,8 +501,8 @@ JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_checksum
+ - __be32
+ - t_checksum
- Checksum of the journal UUID + the descriptor block, with this field set
to zero.
@@ -521,25 +543,25 @@ length, but use a full block:
- Name
- Description
* - 0x0
- - journal\_header\_t
- - r\_header
+ - journal_header_t
+ - r_header
- Common block header.
* - 0xC
- - \_\_be32
- - r\_count
+ - __be32
+ - r_count
- Number of bytes used in this block.
* - 0x10
- - \_\_be32 or \_\_be64
+ - __be32 or __be64
- blocks[0]
- Blocks to revoke.
-After r\_count is a linear array of block numbers that are effectively
+After r_count is a linear array of block numbers that are effectively
revoked by this transaction. The size of each block number is 8 bytes if
the superblock advertises 64-bit block number support, or 4 bytes
otherwise.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
-JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
block is a ``struct jbd2_journal_revoke_tail``, which has this format:
.. list-table::
@@ -551,8 +573,8 @@ block is a ``struct jbd2_journal_revoke_tail``, which has this format:
- Name
- Description
* - 0x0
- - \_\_be32
- - r\_checksum
+ - __be32
+ - r_checksum
- Checksum of the journal UUID + revocation block
Commit Block
@@ -575,37 +597,165 @@ bytes long (but uses a full block):
- Name
- Descriptor
* - 0x0
- - journal\_header\_s
+ - journal_header_s
- (open coded)
- Common block header.
* - 0xC
- unsigned char
- - h\_chksum\_type
+ - h_chksum_type
- The type of checksum to use to verify the integrity of the data blocks
in the transaction. See jbd2_checksum_type_ for more info.
* - 0xD
- unsigned char
- - h\_chksum\_size
+ - h_chksum_size
- The number of bytes used by the checksum. Most likely 4.
* - 0xE
- unsigned char
- - h\_padding[2]
+ - h_padding[2]
-
* - 0x10
- - \_\_be32
- - h\_chksum[JBD2\_CHECKSUM\_BYTES]
+ - __be32
+ - h_chksum[JBD2_CHECKSUM_BYTES]
- 32 bytes of space to store checksums. If
- JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
+ JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
are set, the first ``__be32`` is the checksum of the journal UUID and
the entire commit block, with this field zeroed. If
- JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
+ JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
crc32 of all the blocks already written to the transaction.
* - 0x30
- - \_\_be64
- - h\_commit\_sec
+ - __be64
+ - h_commit_sec
- The time that the transaction was committed, in seconds since the epoch.
* - 0x38
- - \_\_be32
- - h\_commit\_nsec
+ - __be32
+ - h_commit_nsec
- Nanoseconds component of the above timestamp.
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+ :widths: 8 20 20 32
+ :header-rows: 1
+
+ * - Tag
+ - Meaning
+ - Value struct
+ - Description
+ * - EXT4_FC_TAG_HEAD
+ - Fast commit area header
+ - ``struct ext4_fc_head``
+ - Stores the TID of the transaction after which these fast commits should
+ be applied.
+ * - EXT4_FC_TAG_ADD_RANGE
+ - Add extent to inode
+ - ``struct ext4_fc_add_range``
+ - Stores the inode number and extent to be added in this inode
+ * - EXT4_FC_TAG_DEL_RANGE
+ - Remove logical offsets to inode
+ - ``struct ext4_fc_del_range``
+ - Stores the inode number and the logical offset range that needs to be
+ removed
+ * - EXT4_FC_TAG_CREAT
+ - Create directory entry for a newly created file
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry of the
+ newly created file
+ * - EXT4_FC_TAG_LINK
+ - Link a directory entry to an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+ * - EXT4_FC_TAG_UNLINK
+ - Unlink a directory entry of an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+
+ * - EXT4_FC_TAG_PAD
+ - Padding (unused area)
+ - None
+ - Unused bytes in the fast commit area.
+
+ * - EXT4_FC_TAG_TAIL
+ - Mark the end of a fast commit
+ - ``struct ext4_fc_tail``
+ - Stores the TID of the commit, CRC of the fast commit of which this tag
+ represents the end of
+
+Fast Commit Replay Idempotence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Fast commits tags are idempotent in nature provided the recovery code follows
+certain rules. The guiding principle that the commit path follows while
+committing is that it stores the result of a particular operation instead of
+storing the procedure.
+
+Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
+was associated with inode 10. During fast commit, instead of storing this
+operation as a procedure "rename a to b", we store the resulting file system
+state as a "series" of outcomes:
+
+- Link dirent b to inode 10
+- Unlink dirent a
+- Inode 10 with valid refcount
+
+Now when recovery code runs, it needs "enforce" this state on the file
+system. This is what guarantees idempotence of fast commit replay.
+
+Let's take an example of a procedure that is not idempotent and see how fast
+commits make it idempotent. Consider following sequence of operations:
+
+1) rm A
+2) mv B A
+3) read A
+
+If we store this sequence of operations as is then the replay is not idempotent.
+Let's say while in replay, we crash after (2). During the second replay,
+file A (which was actually created as a result of "mv B A" operation) would get
+deleted. Thus, file named A would be absent when we try to read A. So, this
+sequence of operations is not idempotent. However, as mentioned above, instead
+of storing the procedure fast commits store the outcome of each procedure. Thus
+the fast commit log for above procedure would be as follows:
+
+(Let's assume dirent A was linked to inode 10 and dirent B was linked to
+inode 11 before the replay)
+
+1) Unlink A
+2) Link A to inode 11
+3) Unlink B
+4) Inode 11
+
+If we crash after (3) we will have file A linked to inode 11. During the second
+replay, we will remove file A (inode 11). But we will create it back and make
+it point to inode 11. We won't find B, so we'll just skip that step. At this
+point, the refcount for inode 11 is not reliable, but that gets fixed by the
+replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
+into a series of idempotent outcomes, fast commits ensured idempotence during
+the replay.
+
+Journal Checkpoint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Checkpointing the journal ensures all transactions and their associated buffers
+are submitted to the disk. In-progress transactions are waited upon and included
+in the checkpoint. Checkpointing is used internally during critical updates to
+the filesystem including journal recovery, filesystem resizing, and freeing of
+the journal_t structure.
+
+A journal checkpoint can be triggered from userspace via the ioctl
+EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
+Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
+can be used to verify input to the ioctl. It returns error if there is any
+invalid input, otherwise it returns success without performing
+any checkpointing. This can be used to check whether the ioctl exists on a
+system and to verify there are no issues with arguments or flags. The
+other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
+EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
+discarded or zero-filled, respectively, after the journal checkpoint is
+complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
+cannot both be set. The ioctl may be useful when snapshotting a system or for
+complying with content deletion SLOs.
diff --git a/Documentation/filesystems/ext4/mmp.rst b/Documentation/filesystems/ext4/mmp.rst
index 25660981d93c..174dd6538737 100644
--- a/Documentation/filesystems/ext4/mmp.rst
+++ b/Documentation/filesystems/ext4/mmp.rst
@@ -7,8 +7,8 @@ Multiple mount protection (MMP) is a feature that protects the
filesystem against multiple hosts trying to use the filesystem
simultaneously. When a filesystem is opened (for mounting, or fsck,
etc.), the MMP code running on the node (call it node A) checks a
-sequence number. If the sequence number is EXT4\_MMP\_SEQ\_CLEAN, the
-open continues. If the sequence number is EXT4\_MMP\_SEQ\_FSCK, then
+sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the
+open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then
fsck is (hopefully) running, and open fails immediately. Otherwise, the
open code will wait for twice the specified MMP check interval and check
the sequence number again. If the sequence number has changed, then the
@@ -40,38 +40,38 @@ The MMP structure (``struct mmp_struct``) is as follows:
- Name
- Description
* - 0x0
- - \_\_le32
- - mmp\_magic
+ - __le32
+ - mmp_magic
- Magic number for MMP, 0x004D4D50 (“MMP”).
* - 0x4
- - \_\_le32
- - mmp\_seq
+ - __le32
+ - mmp_seq
- Sequence number, updated periodically.
* - 0x8
- - \_\_le64
- - mmp\_time
+ - __le64
+ - mmp_time
- Time that the MMP block was last updated.
* - 0x10
- char[64]
- - mmp\_nodename
+ - mmp_nodename
- Hostname of the node that opened the filesystem.
* - 0x50
- char[32]
- - mmp\_bdevname
+ - mmp_bdevname
- Block device name of the filesystem.
* - 0x70
- - \_\_le16
- - mmp\_check\_interval
+ - __le16
+ - mmp_check_interval
- The MMP re-check interval, in seconds.
* - 0x72
- - \_\_le16
- - mmp\_pad1
+ - __le16
+ - mmp_pad1
- Zero.
* - 0x74
- - \_\_le32[226]
- - mmp\_pad2
+ - __le32[226]
+ - mmp_pad2
- Zero.
* - 0x3FC
- - \_\_le32
- - mmp\_checksum
+ - __le32
+ - mmp_checksum
- Checksum of the MMP block.
diff --git a/Documentation/filesystems/ext4/orphan.rst b/Documentation/filesystems/ext4/orphan.rst
new file mode 100644
index 000000000000..03cca178864b
--- /dev/null
+++ b/Documentation/filesystems/ext4/orphan.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Orphan file
+-----------
+
+In unix there can inodes that are unlinked from directory hierarchy but that
+are still alive because they are open. In case of crash the filesystem has to
+clean up these inodes as otherwise they (and the blocks referenced from them)
+would leak. Similarly if we truncate or extend the file, we need not be able
+to perform the operation in a single journalling transaction. In such case we
+track the inode as orphan so that in case of crash extra blocks allocated to
+the file get truncated.
+
+Traditionally ext4 tracks orphan inodes in a form of single linked list where
+superblock contains the inode number of the last orphan inode (s_last_orphan
+field) and then each inode contains inode number of the previously orphaned
+inode (we overload i_dtime inode field for this). However this filesystem
+global single linked list is a scalability bottleneck for workloads that result
+in heavy creation of orphan inodes. When orphan file feature
+(COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode
+(referenced from the superblock through s_orphan_file_inum) with several
+blocks. Each of these blocks has a structure:
+
+============= ================ =============== ===============================
+Offset Type Name Description
+============= ================ =============== ===============================
+0x0 Array of Orphan inode Each __le32 entry is either
+ __le32 entries entries empty (0) or it contains
+ inode number of an orphan
+ inode.
+blocksize-8 __le32 ob_magic Magic value stored in orphan
+ block tail (0x0b10ca04)
+blocksize-4 __le32 ob_checksum Checksum of the orphan block.
+============= ================ =============== ===============================
+
+When a filesystem with orphan file feature is writeably mounted, we set
+RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may
+be valid orphan entries. In case we see this feature when mounting the
+filesystem, we read the whole orphan file and process all orphan inodes found
+there as usual. When cleanly unmounting the filesystem we remove the
+RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orphan
+file and also make the filesystem fully compatible with older kernels.
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 123ebfde47ee..0fad6eda6e15 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -7,7 +7,7 @@ An ext4 file system is split into a series of block groups. To reduce
performance difficulties due to fragmentation, the block allocator tries
very hard to keep each file's blocks within the same group, thereby
reducing seek times. The size of a block group is specified in
-``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \*
+``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 *
``block_size_in_bytes``. With the default block size of 4KiB, each group
will contain 32,768 blocks, for a length of 128MiB. The number of block
groups is the size of the device divided by the size of a block group.
diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst
index 9061aabba827..fc0636901fa0 100644
--- a/Documentation/filesystems/ext4/special_inodes.rst
+++ b/Documentation/filesystems/ext4/special_inodes.rst
@@ -34,5 +34,22 @@ ext4 reserves some inode for special features, as follows:
* - 10
- Replica inode, used for some non-upstream feature?
* - 11
- - Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock.
+ - Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
+Note that there are also some inodes allocated from non-reserved inode numbers
+for other filesystem features which are not referenced from standard directory
+hierarchy. These are generally reference from the superblock. They are:
+
+.. list-table::
+ :widths: 20 50
+ :header-rows: 1
+
+ * - Superblock field
+ - Description
+
+ * - s_lpf_ino
+ - Inode number of lost+found directory.
+ * - s_prj_quota_inum
+ - Inode number of quota file tracking project quotas
+ * - s_orphan_file_inum
+ - Inode number of file tracking orphan inodes.
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 93e55d7c1d40..a1eb4a11a1d0 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -7,7 +7,7 @@ The superblock records various information about the enclosing
filesystem, such as block counts, inode counts, supported features,
maintenance information, and more.
-If the sparse\_super feature flag is set, redundant copies of the
+If the sparse_super feature flag is set, redundant copies of the
superblock and group descriptors are kept only in the groups whose group
number is either 0 or a power of 3, 5, or 7. If the flag is not set,
redundant copies are kept in all groups.
@@ -27,107 +27,107 @@ The ext4 superblock is laid out as follows in
- Name
- Description
* - 0x0
- - \_\_le32
- - s\_inodes\_count
+ - __le32
+ - s_inodes_count
- Total inode count.
* - 0x4
- - \_\_le32
- - s\_blocks\_count\_lo
+ - __le32
+ - s_blocks_count_lo
- Total block count.
* - 0x8
- - \_\_le32
- - s\_r\_blocks\_count\_lo
+ - __le32
+ - s_r_blocks_count_lo
- This number of blocks can only be allocated by the super-user.
* - 0xC
- - \_\_le32
- - s\_free\_blocks\_count\_lo
+ - __le32
+ - s_free_blocks_count_lo
- Free block count.
* - 0x10
- - \_\_le32
- - s\_free\_inodes\_count
+ - __le32
+ - s_free_inodes_count
- Free inode count.
* - 0x14
- - \_\_le32
- - s\_first\_data\_block
+ - __le32
+ - s_first_data_block
- First data block. This must be at least 1 for 1k-block filesystems and
is typically 0 for all other block sizes.
* - 0x18
- - \_\_le32
- - s\_log\_block\_size
- - Block size is 2 ^ (10 + s\_log\_block\_size).
+ - __le32
+ - s_log_block_size
+ - Block size is 2 ^ (10 + s_log_block_size).
* - 0x1C
- - \_\_le32
- - s\_log\_cluster\_size
- - Cluster size is 2 ^ (10 + s\_log\_cluster\_size) blocks if bigalloc is
- enabled. Otherwise s\_log\_cluster\_size must equal s\_log\_block\_size.
+ - __le32
+ - s_log_cluster_size
+ - Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is
+ enabled. Otherwise s_log_cluster_size must equal s_log_block_size.
* - 0x20
- - \_\_le32
- - s\_blocks\_per\_group
+ - __le32
+ - s_blocks_per_group
- Blocks per group.
* - 0x24
- - \_\_le32
- - s\_clusters\_per\_group
+ - __le32
+ - s_clusters_per_group
- Clusters per group, if bigalloc is enabled. Otherwise
- s\_clusters\_per\_group must equal s\_blocks\_per\_group.
+ s_clusters_per_group must equal s_blocks_per_group.
* - 0x28
- - \_\_le32
- - s\_inodes\_per\_group
+ - __le32
+ - s_inodes_per_group
- Inodes per group.
* - 0x2C
- - \_\_le32
- - s\_mtime
+ - __le32
+ - s_mtime
- Mount time, in seconds since the epoch.
* - 0x30
- - \_\_le32
- - s\_wtime
+ - __le32
+ - s_wtime
- Write time, in seconds since the epoch.
* - 0x34
- - \_\_le16
- - s\_mnt\_count
+ - __le16
+ - s_mnt_count
- Number of mounts since the last fsck.
* - 0x36
- - \_\_le16
- - s\_max\_mnt\_count
+ - __le16
+ - s_max_mnt_count
- Number of mounts beyond which a fsck is needed.
* - 0x38
- - \_\_le16
- - s\_magic
+ - __le16
+ - s_magic
- Magic signature, 0xEF53
* - 0x3A
- - \_\_le16
- - s\_state
+ - __le16
+ - s_state
- File system state. See super_state_ for more info.
* - 0x3C
- - \_\_le16
- - s\_errors
+ - __le16
+ - s_errors
- Behaviour when detecting errors. See super_errors_ for more info.
* - 0x3E
- - \_\_le16
- - s\_minor\_rev\_level
+ - __le16
+ - s_minor_rev_level
- Minor revision level.
* - 0x40
- - \_\_le32
- - s\_lastcheck
+ - __le32
+ - s_lastcheck
- Time of last check, in seconds since the epoch.
* - 0x44
- - \_\_le32
- - s\_checkinterval
+ - __le32
+ - s_checkinterval
- Maximum time between checks, in seconds.
* - 0x48
- - \_\_le32
- - s\_creator\_os
+ - __le32
+ - s_creator_os
- Creator OS. See the table super_creator_ for more info.
* - 0x4C
- - \_\_le32
- - s\_rev\_level
+ - __le32
+ - s_rev_level
- Revision level. See the table super_revision_ for more info.
* - 0x50
- - \_\_le16
- - s\_def\_resuid
+ - __le16
+ - s_def_resuid
- Default uid for reserved blocks.
* - 0x52
- - \_\_le16
- - s\_def\_resgid
+ - __le16
+ - s_def_resgid
- Default gid for reserved blocks.
* -
-
@@ -143,50 +143,50 @@ The ext4 superblock is laid out as follows in
about a feature in either the compatible or incompatible feature set, it
must abort and not try to meddle with things it doesn't understand...
* - 0x54
- - \_\_le32
- - s\_first\_ino
+ - __le32
+ - s_first_ino
- First non-reserved inode.
* - 0x58
- - \_\_le16
- - s\_inode\_size
+ - __le16
+ - s_inode_size
- Size of inode structure, in bytes.
* - 0x5A
- - \_\_le16
- - s\_block\_group\_nr
+ - __le16
+ - s_block_group_nr
- Block group # of this superblock.
* - 0x5C
- - \_\_le32
- - s\_feature\_compat
+ - __le32
+ - s_feature_compat
- Compatible feature set flags. Kernel can still read/write this fs even
if it doesn't understand a flag; fsck should not do that. See the
super_compat_ table for more info.
* - 0x60
- - \_\_le32
- - s\_feature\_incompat
+ - __le32
+ - s_feature_incompat
- Incompatible feature set. If the kernel or fsck doesn't understand one
of these bits, it should stop. See the super_incompat_ table for more
info.
* - 0x64
- - \_\_le32
- - s\_feature\_ro\_compat
+ - __le32
+ - s_feature_ro_compat
- Readonly-compatible feature set. If the kernel doesn't understand one of
these bits, it can still mount read-only. See the super_rocompat_ table
for more info.
* - 0x68
- - \_\_u8
- - s\_uuid[16]
+ - __u8
+ - s_uuid[16]
- 128-bit UUID for volume.
* - 0x78
- char
- - s\_volume\_name[16]
+ - s_volume_name[16]
- Volume label.
* - 0x88
- char
- - s\_last\_mounted[64]
+ - s_last_mounted[64]
- Directory where filesystem was last mounted.
* - 0xC8
- - \_\_le32
- - s\_algorithm\_usage\_bitmap
+ - __le32
+ - s_algorithm_usage_bitmap
- For compression (Not used in e2fsprogs/Linux)
* -
-
@@ -194,18 +194,18 @@ The ext4 superblock is laid out as follows in
- Performance hints. Directory preallocation should only happen if the
EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.
* - 0xCC
- - \_\_u8
- - s\_prealloc\_blocks
+ - __u8
+ - s_prealloc_blocks
- #. of blocks to try to preallocate for ... files? (Not used in
e2fsprogs/Linux)
* - 0xCD
- - \_\_u8
- - s\_prealloc\_dir\_blocks
+ - __u8
+ - s_prealloc_dir_blocks
- #. of blocks to preallocate for directories. (Not used in
e2fsprogs/Linux)
* - 0xCE
- - \_\_le16
- - s\_reserved\_gdt\_blocks
+ - __le16
+ - s_reserved_gdt_blocks
- Number of reserved GDT entries for future filesystem expansion.
* -
-
@@ -213,277 +213,281 @@ The ext4 superblock is laid out as follows in
- Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNAL is
set.
* - 0xD0
- - \_\_u8
- - s\_journal\_uuid[16]
+ - __u8
+ - s_journal_uuid[16]
- UUID of journal superblock
* - 0xE0
- - \_\_le32
- - s\_journal\_inum
+ - __le32
+ - s_journal_inum
- inode number of journal file.
* - 0xE4
- - \_\_le32
- - s\_journal\_dev
+ - __le32
+ - s_journal_dev
- Device number of journal file, if the external journal feature flag is
set.
* - 0xE8
- - \_\_le32
- - s\_last\_orphan
+ - __le32
+ - s_last_orphan
- Start of list of orphaned inodes to delete.
* - 0xEC
- - \_\_le32
- - s\_hash\_seed[4]
+ - __le32
+ - s_hash_seed[4]
- HTREE hash seed.
* - 0xFC
- - \_\_u8
- - s\_def\_hash\_version
+ - __u8
+ - s_def_hash_version
- Default hash algorithm to use for directory hashes. See super_def_hash_
for more info.
* - 0xFD
- - \_\_u8
- - s\_jnl\_backup\_type
- - If this value is 0 or EXT3\_JNL\_BACKUP\_BLOCKS (1), then the
+ - __u8
+ - s_jnl_backup_type
+ - If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the
``s_jnl_blocks`` field contains a duplicate copy of the inode's
``i_block[]`` array and ``i_size``.
* - 0xFE
- - \_\_le16
- - s\_desc\_size
+ - __le16
+ - s_desc_size
- Size of group descriptors, in bytes, if the 64bit incompat feature flag
is set.
* - 0x100
- - \_\_le32
- - s\_default\_mount\_opts
+ - __le32
+ - s_default_mount_opts
- Default mount options. See the super_mountopts_ table for more info.
* - 0x104
- - \_\_le32
- - s\_first\_meta\_bg
- - First metablock block group, if the meta\_bg feature is enabled.
+ - __le32
+ - s_first_meta_bg
+ - First metablock block group, if the meta_bg feature is enabled.
* - 0x108
- - \_\_le32
- - s\_mkfs\_time
+ - __le32
+ - s_mkfs_time
- When the filesystem was created, in seconds since the epoch.
* - 0x10C
- - \_\_le32
- - s\_jnl\_blocks[17]
+ - __le32
+ - s_jnl_blocks[17]
- Backup copy of the journal inode's ``i_block[]`` array in the first 15
- elements and i\_size\_high and i\_size in the 16th and 17th elements,
+ elements and i_size_high and i_size in the 16th and 17th elements,
respectively.
* -
-
-
- 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set.
* - 0x150
- - \_\_le32
- - s\_blocks\_count\_hi
+ - __le32
+ - s_blocks_count_hi
- High 32-bits of the block count.
* - 0x154
- - \_\_le32
- - s\_r\_blocks\_count\_hi
+ - __le32
+ - s_r_blocks_count_hi
- High 32-bits of the reserved block count.
* - 0x158
- - \_\_le32
- - s\_free\_blocks\_count\_hi
+ - __le32
+ - s_free_blocks_count_hi
- High 32-bits of the free block count.
* - 0x15C
- - \_\_le16
- - s\_min\_extra\_isize
+ - __le16
+ - s_min_extra_isize
- All inodes have at least # bytes.
* - 0x15E
- - \_\_le16
- - s\_want\_extra\_isize
+ - __le16
+ - s_want_extra_isize
- New inodes should reserve # bytes.
* - 0x160
- - \_\_le32
- - s\_flags
+ - __le32
+ - s_flags
- Miscellaneous flags. See the super_flags_ table for more info.
* - 0x164
- - \_\_le16
- - s\_raid\_stride
+ - __le16
+ - s_raid_stride
- RAID stride. This is the number of logical blocks read from or written
to the disk before moving to the next disk. This affects the placement
of filesystem metadata, which will hopefully make RAID storage faster.
* - 0x166
- - \_\_le16
- - s\_mmp\_interval
+ - __le16
+ - s_mmp_interval
- #. seconds to wait in multi-mount prevention (MMP) checking. In theory,
MMP is a mechanism to record in the superblock which host and device
have mounted the filesystem, in order to prevent multiple mounts. This
feature does not seem to be implemented...
* - 0x168
- - \_\_le64
- - s\_mmp\_block
+ - __le64
+ - s_mmp_block
- Block # for multi-mount protection data.
* - 0x170
- - \_\_le32
- - s\_raid\_stripe\_width
+ - __le32
+ - s_raid_stripe_width
- RAID stripe width. This is the number of logical blocks read from or
written to the disk before coming back to the current disk. This is used
by the block allocator to try to reduce the number of read-modify-write
operations in a RAID5/6.
* - 0x174
- - \_\_u8
- - s\_log\_groups\_per\_flex
+ - __u8
+ - s_log_groups_per_flex
- Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``.
* - 0x175
- - \_\_u8
- - s\_checksum\_type
+ - __u8
+ - s_checksum_type
- Metadata checksum algorithm type. The only valid value is 1 (crc32c).
* - 0x176
- - \_\_le16
- - s\_reserved\_pad
+ - __le16
+ - s_reserved_pad
-
* - 0x178
- - \_\_le64
- - s\_kbytes\_written
+ - __le64
+ - s_kbytes_written
- Number of KiB written to this filesystem over its lifetime.
* - 0x180
- - \_\_le32
- - s\_snapshot\_inum
+ - __le32
+ - s_snapshot_inum
- inode number of active snapshot. (Not used in e2fsprogs/Linux.)
* - 0x184
- - \_\_le32
- - s\_snapshot\_id
+ - __le32
+ - s_snapshot_id
- Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.)
* - 0x188
- - \_\_le64
- - s\_snapshot\_r\_blocks\_count
+ - __le64
+ - s_snapshot_r_blocks_count
- Number of blocks reserved for active snapshot's future use. (Not used in
e2fsprogs/Linux.)
* - 0x190
- - \_\_le32
- - s\_snapshot\_list
+ - __le32
+ - s_snapshot_list
- inode number of the head of the on-disk snapshot list. (Not used in
e2fsprogs/Linux.)
* - 0x194
- - \_\_le32
- - s\_error\_count
+ - __le32
+ - s_error_count
- Number of errors seen.
* - 0x198
- - \_\_le32
- - s\_first\_error\_time
+ - __le32
+ - s_first_error_time
- First time an error happened, in seconds since the epoch.
* - 0x19C
- - \_\_le32
- - s\_first\_error\_ino
+ - __le32
+ - s_first_error_ino
- inode involved in first error.
* - 0x1A0
- - \_\_le64
- - s\_first\_error\_block
+ - __le64
+ - s_first_error_block
- Number of block involved of first error.
* - 0x1A8
- - \_\_u8
- - s\_first\_error\_func[32]
+ - __u8
+ - s_first_error_func[32]
- Name of function where the error happened.
* - 0x1C8
- - \_\_le32
- - s\_first\_error\_line
+ - __le32
+ - s_first_error_line
- Line number where error happened.
* - 0x1CC
- - \_\_le32
- - s\_last\_error\_time
+ - __le32
+ - s_last_error_time
- Time of most recent error, in seconds since the epoch.
* - 0x1D0
- - \_\_le32
- - s\_last\_error\_ino
+ - __le32
+ - s_last_error_ino
- inode involved in most recent error.
* - 0x1D4
- - \_\_le32
- - s\_last\_error\_line
+ - __le32
+ - s_last_error_line
- Line number where most recent error happened.
* - 0x1D8
- - \_\_le64
- - s\_last\_error\_block
+ - __le64
+ - s_last_error_block
- Number of block involved in most recent error.
* - 0x1E0
- - \_\_u8
- - s\_last\_error\_func[32]
+ - __u8
+ - s_last_error_func[32]
- Name of function where the most recent error happened.
* - 0x200
- - \_\_u8
- - s\_mount\_opts[64]
+ - __u8
+ - s_mount_opts[64]
- ASCIIZ string of mount options.
* - 0x240
- - \_\_le32
- - s\_usr\_quota\_inum
+ - __le32
+ - s_usr_quota_inum
- Inode number of user `quota <quota>`__ file.
* - 0x244
- - \_\_le32
- - s\_grp\_quota\_inum
+ - __le32
+ - s_grp_quota_inum
- Inode number of group `quota <quota>`__ file.
* - 0x248
- - \_\_le32
- - s\_overhead\_blocks
+ - __le32
+ - s_overhead_blocks
- Overhead blocks/clusters in fs. (Huh? This field is always zero, which
means that the kernel calculates it dynamically.)
* - 0x24C
- - \_\_le32
- - s\_backup\_bgs[2]
- - Block groups containing superblock backups (if sparse\_super2)
+ - __le32
+ - s_backup_bgs[2]
+ - Block groups containing superblock backups (if sparse_super2)
* - 0x254
- - \_\_u8
- - s\_encrypt\_algos[4]
+ - __u8
+ - s_encrypt_algos[4]
- Encryption algorithms in use. There can be up to four algorithms in use
at any time; valid algorithm codes are given in the super_encrypt_ table
below.
* - 0x258
- - \_\_u8
- - s\_encrypt\_pw\_salt[16]
+ - __u8
+ - s_encrypt_pw_salt[16]
- Salt for the string2key algorithm for encryption.
* - 0x268
- - \_\_le32
- - s\_lpf\_ino
+ - __le32
+ - s_lpf_ino
- Inode number of lost+found
* - 0x26C
- - \_\_le32
- - s\_prj\_quota\_inum
+ - __le32
+ - s_prj_quota_inum
- Inode that tracks project quotas.
* - 0x270
- - \_\_le32
- - s\_checksum\_seed
- - Checksum seed used for metadata\_csum calculations. This value is
- crc32c(~0, $orig\_fs\_uuid).
+ - __le32
+ - s_checksum_seed
+ - Checksum seed used for metadata_csum calculations. This value is
+ crc32c(~0, $orig_fs_uuid).
* - 0x274
- - \_\_u8
- - s\_wtime_hi
+ - __u8
+ - s_wtime_hi
- Upper 8 bits of the s_wtime field.
* - 0x275
- - \_\_u8
- - s\_mtime_hi
+ - __u8
+ - s_mtime_hi
- Upper 8 bits of the s_mtime field.
* - 0x276
- - \_\_u8
- - s\_mkfs_time_hi
+ - __u8
+ - s_mkfs_time_hi
- Upper 8 bits of the s_mkfs_time field.
* - 0x277
- - \_\_u8
- - s\_lastcheck_hi
- - Upper 8 bits of the s_lastcheck_hi field.
+ - __u8
+ - s_lastcheck_hi
+ - Upper 8 bits of the s_lastcheck field.
* - 0x278
- - \_\_u8
- - s\_first_error_time_hi
- - Upper 8 bits of the s_first_error_time_hi field.
+ - __u8
+ - s_first_error_time_hi
+ - Upper 8 bits of the s_first_error_time field.
* - 0x279
- - \_\_u8
- - s\_last_error_time_hi
- - Upper 8 bits of the s_last_error_time_hi field.
+ - __u8
+ - s_last_error_time_hi
+ - Upper 8 bits of the s_last_error_time field.
* - 0x27A
- - \_\_u8
- - s\_pad[2]
+ - __u8
+ - s_pad[2]
- Zero padding.
* - 0x27C
- - \_\_le16
- - s\_encoding
+ - __le16
+ - s_encoding
- Filename charset encoding.
* - 0x27E
- - \_\_le16
- - s\_encoding_flags
+ - __le16
+ - s_encoding_flags
- Filename charset encoding flags.
* - 0x280
- - \_\_le32
- - s\_reserved[95]
+ - __le32
+ - s_orphan_file_inum
+ - Orphan file inode number.
+ * - 0x284
+ - __le32
+ - s_reserved[94]
- Padding to the end of the block.
* - 0x3FC
- - \_\_le32
- - s\_checksum
+ - __le32
+ - s_checksum
- Superblock checksum.
.. _super_state:
@@ -570,32 +574,44 @@ following:
* - Value
- Description
* - 0x1
- - Directory preallocation (COMPAT\_DIR\_PREALLOC).
+ - Directory preallocation (COMPAT_DIR_PREALLOC).
* - 0x2
- “imagic inodes”. Not clear from the code what this does
- (COMPAT\_IMAGIC\_INODES).
+ (COMPAT_IMAGIC_INODES).
* - 0x4
- - Has a journal (COMPAT\_HAS\_JOURNAL).
+ - Has a journal (COMPAT_HAS_JOURNAL).
* - 0x8
- - Supports extended attributes (COMPAT\_EXT\_ATTR).
+ - Supports extended attributes (COMPAT_EXT_ATTR).
* - 0x10
- Has reserved GDT blocks for filesystem expansion
- (COMPAT\_RESIZE\_INODE). Requires RO\_COMPAT\_SPARSE\_SUPER.
+ (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER.
* - 0x20
- - Has directory indices (COMPAT\_DIR\_INDEX).
+ - Has directory indices (COMPAT_DIR_INDEX).
* - 0x40
- “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized
- block groups? (COMPAT\_LAZY\_BG)
+ block groups? (COMPAT_LAZY_BG)
* - 0x80
- - “Exclude inode”. Not used. (COMPAT\_EXCLUDE\_INODE).
+ - “Exclude inode”. Not used. (COMPAT_EXCLUDE_INODE).
* - 0x100
- “Exclude bitmap”. Seems to be used to indicate the presence of
snapshot-related exclude bitmaps? Not defined in kernel or used in
- e2fsprogs (COMPAT\_EXCLUDE\_BITMAP).
+ e2fsprogs (COMPAT_EXCLUDE_BITMAP).
* - 0x200
- - Sparse Super Block, v2. If this flag is set, the SB field s\_backup\_bgs
+ - Sparse Super Block, v2. If this flag is set, the SB field s_backup_bgs
points to the two block groups that contain backup superblocks
- (COMPAT\_SPARSE\_SUPER2).
+ (COMPAT_SPARSE_SUPER2).
+ * - 0x400
+ - Fast commits supported. Although fast commits blocks are
+ backward incompatible, fast commit blocks are not always
+ present in the journal. If fast commit blocks are present in
+ the journal, JBD2 incompat feature
+ (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets
+ set (COMPAT_FAST_COMMIT).
+ * - 0x1000
+ - Orphan file allocated. This is the special file for more efficient
+ tracking of unlinked but still open inodes. When there may be any
+ entries in the file, we additionally set proper rocompat feature
+ (RO_COMPAT_ORPHAN_PRESENT).
.. _super_incompat:
@@ -609,45 +625,45 @@ following:
* - Value
- Description
* - 0x1
- - Compression (INCOMPAT\_COMPRESSION).
+ - Compression (INCOMPAT_COMPRESSION).
* - 0x2
- - Directory entries record the file type. See ext4\_dir\_entry\_2 below
- (INCOMPAT\_FILETYPE).
+ - Directory entries record the file type. See ext4_dir_entry_2 below
+ (INCOMPAT_FILETYPE).
* - 0x4
- - Filesystem needs recovery (INCOMPAT\_RECOVER).
+ - Filesystem needs recovery (INCOMPAT_RECOVER).
* - 0x8
- - Filesystem has a separate journal device (INCOMPAT\_JOURNAL\_DEV).
+ - Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV).
* - 0x10
- Meta block groups. See the earlier discussion of this feature
- (INCOMPAT\_META\_BG).
+ (INCOMPAT_META_BG).
* - 0x40
- - Files in this filesystem use extents (INCOMPAT\_EXTENTS).
+ - Files in this filesystem use extents (INCOMPAT_EXTENTS).
* - 0x80
- - Enable a filesystem size of 2^64 blocks (INCOMPAT\_64BIT).
+ - Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT).
* - 0x100
- - Multiple mount protection (INCOMPAT\_MMP).
+ - Multiple mount protection (INCOMPAT_MMP).
* - 0x200
- Flexible block groups. See the earlier discussion of this feature
- (INCOMPAT\_FLEX\_BG).
+ (INCOMPAT_FLEX_BG).
* - 0x400
- Inodes can be used to store large extended attribute values
- (INCOMPAT\_EA\_INODE).
+ (INCOMPAT_EA_INODE).
* - 0x1000
- - Data in directory entry (INCOMPAT\_DIRDATA). (Not implemented?)
+ - Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?)
* - 0x2000
- Metadata checksum seed is stored in the superblock. This feature enables
- the administrator to change the UUID of a metadata\_csum filesystem
+ the administrator to change the UUID of a metadata_csum filesystem
while the filesystem is mounted; without it, the checksum definition
- requires all metadata blocks to be rewritten (INCOMPAT\_CSUM\_SEED).
+ requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED).
* - 0x4000
- - Large directory >2GB or 3-level htree (INCOMPAT\_LARGEDIR). Prior to
+ - Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to
this feature, directories could not be larger than 4GiB and could not
have an htree more than 2 levels deep. If this feature is enabled,
directories can be larger than 4GiB and have a maximum htree depth of 3.
* - 0x8000
- - Data in inode (INCOMPAT\_INLINE\_DATA).
+ - Data in inode (INCOMPAT_INLINE_DATA).
* - 0x10000
- - Encrypted inodes are present on the filesystem. (INCOMPAT\_ENCRYPT).
+ - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT).
.. _super_rocompat:
@@ -662,50 +678,54 @@ the following:
- Description
* - 0x1
- Sparse superblocks. See the earlier discussion of this feature
- (RO\_COMPAT\_SPARSE\_SUPER).
+ (RO_COMPAT_SPARSE_SUPER).
* - 0x2
- This filesystem has been used to store a file greater than 2GiB
- (RO\_COMPAT\_LARGE\_FILE).
+ (RO_COMPAT_LARGE_FILE).
* - 0x4
- - Not used in kernel or e2fsprogs (RO\_COMPAT\_BTREE\_DIR).
+ - Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR).
* - 0x8
- This filesystem has files whose sizes are represented in units of
logical blocks, not 512-byte sectors. This implies a very large file
- indeed! (RO\_COMPAT\_HUGE\_FILE)
+ indeed! (RO_COMPAT_HUGE_FILE)
* - 0x10
- Group descriptors have checksums. In addition to detecting corruption,
this is useful for lazy formatting with uninitialized groups
- (RO\_COMPAT\_GDT\_CSUM).
+ (RO_COMPAT_GDT_CSUM).
* - 0x20
- Indicates that the old ext3 32,000 subdirectory limit no longer applies
- (RO\_COMPAT\_DIR\_NLINK). A directory's i\_links\_count will be set to 1
+ (RO_COMPAT_DIR_NLINK). A directory's i_links_count will be set to 1
if it is incremented past 64,999.
* - 0x40
- Indicates that large inodes exist on this filesystem
- (RO\_COMPAT\_EXTRA\_ISIZE).
+ (RO_COMPAT_EXTRA_ISIZE).
* - 0x80
- - This filesystem has a snapshot (RO\_COMPAT\_HAS\_SNAPSHOT).
+ - This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT).
* - 0x100
- - `Quota <Quota>`__ (RO\_COMPAT\_QUOTA).
+ - `Quota <Quota>`__ (RO_COMPAT_QUOTA).
* - 0x200
- This filesystem supports “bigalloc”, which means that file extents are
tracked in units of clusters (of blocks) instead of blocks
- (RO\_COMPAT\_BIGALLOC).
+ (RO_COMPAT_BIGALLOC).
* - 0x400
- This filesystem supports metadata checksumming.
- (RO\_COMPAT\_METADATA\_CSUM; implies RO\_COMPAT\_GDT\_CSUM, though
- GDT\_CSUM must not be set)
+ (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though
+ GDT_CSUM must not be set)
* - 0x800
- Filesystem supports replicas. This feature is neither in the kernel nor
- e2fsprogs. (RO\_COMPAT\_REPLICA)
+ e2fsprogs. (RO_COMPAT_REPLICA)
* - 0x1000
- Read-only filesystem image; the kernel will not mount this image
read-write and most tools will refuse to write to the image.
- (RO\_COMPAT\_READONLY)
+ (RO_COMPAT_READONLY)
* - 0x2000
- - Filesystem tracks project quotas. (RO\_COMPAT\_PROJECT)
+ - Filesystem tracks project quotas. (RO_COMPAT_PROJECT)
* - 0x8000
- - Verity inodes may be present on the filesystem. (RO\_COMPAT\_VERITY)
+ - Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY)
+ * - 0x10000
+ - Indicates orphan file may have valid orphan entries and thus we need
+ to clean them up when mounting the filesystem
+ (RO_COMPAT_ORPHAN_PRESENT).
.. _super_def_hash:
@@ -741,36 +761,36 @@ The ``s_default_mount_opts`` field is any combination of the following:
* - Value
- Description
* - 0x0001
- - Print debugging info upon (re)mount. (EXT4\_DEFM\_DEBUG)
+ - Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG)
* - 0x0002
- New files take the gid of the containing directory (instead of the fsgid
- of the current process). (EXT4\_DEFM\_BSDGROUPS)
+ of the current process). (EXT4_DEFM_BSDGROUPS)
* - 0x0004
- - Support userspace-provided extended attributes. (EXT4\_DEFM\_XATTR\_USER)
+ - Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_USER)
* - 0x0008
- - Support POSIX access control lists (ACLs). (EXT4\_DEFM\_ACL)
+ - Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL)
* - 0x0010
- - Do not support 32-bit UIDs. (EXT4\_DEFM\_UID16)
+ - Do not support 32-bit UIDs. (EXT4_DEFM_UID16)
* - 0x0020
- - All data and metadata are commited to the journal.
- (EXT4\_DEFM\_JMODE\_DATA)
+ - All data and metadata are committed to the journal.
+ (EXT4_DEFM_JMODE_DATA)
* - 0x0040
- All data are flushed to the disk before metadata are committed to the
- journal. (EXT4\_DEFM\_JMODE\_ORDERED)
+ journal. (EXT4_DEFM_JMODE_ORDERED)
* - 0x0060
- Data ordering is not preserved; data may be written after the metadata
- has been written. (EXT4\_DEFM\_JMODE\_WBACK)
+ has been written. (EXT4_DEFM_JMODE_WBACK)
* - 0x0100
- - Disable write flushes. (EXT4\_DEFM\_NOBARRIER)
+ - Disable write flushes. (EXT4_DEFM_NOBARRIER)
* - 0x0200
- Track which blocks in a filesystem are metadata and therefore should not
be used as data blocks. This option will be enabled by default on 3.18,
- hopefully. (EXT4\_DEFM\_BLOCK\_VALIDITY)
+ hopefully. (EXT4_DEFM_BLOCK_VALIDITY)
* - 0x0400
- Enable DISCARD support, where the storage device is told about blocks
- becoming unused. (EXT4\_DEFM\_DISCARD)
+ becoming unused. (EXT4_DEFM_DISCARD)
* - 0x0800
- - Disable delayed allocation. (EXT4\_DEFM\_NODELALLOC)
+ - Disable delayed allocation. (EXT4_DEFM_NODELALLOC)
.. _super_flags:
@@ -800,12 +820,12 @@ The ``s_encrypt_algos`` list can contain any of the following:
* - Value
- Description
* - 0
- - Invalid algorithm (ENCRYPTION\_MODE\_INVALID).
+ - Invalid algorithm (ENCRYPTION_MODE_INVALID).
* - 1
- - 256-bit AES in XTS mode (ENCRYPTION\_MODE\_AES\_256\_XTS).
+ - 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS).
* - 2
- - 256-bit AES in GCM mode (ENCRYPTION\_MODE\_AES\_256\_GCM).
+ - 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM).
* - 3
- - 256-bit AES in CBC mode (ENCRYPTION\_MODE\_AES\_256\_CBC).
+ - 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC).
Total size of the superblock is 1024 bytes.
diff --git a/Documentation/filesystems/ext4/verity.rst b/Documentation/filesystems/ext4/verity.rst
index 3e4c0ee0e068..e99ff3fd09f7 100644
--- a/Documentation/filesystems/ext4/verity.rst
+++ b/Documentation/filesystems/ext4/verity.rst
@@ -39,3 +39,6 @@ is encrypted as well as the data itself.
Verity files cannot have blocks allocated past the end of the verity
metadata.
+
+Verity and DAX are not compatible and attempts to set both of these flags
+on a file will fail.
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index 87d794bc75a4..d32c6209685d 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -25,10 +25,14 @@ a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
-For reporting bugs and sending patches, please use the following mailing list:
+For sending patches, please use the following mailing list:
- linux-f2fs-devel@lists.sourceforge.net
+For reporting bugs, please use the following f2fs bug tracker link:
+
+- https://bugzilla.kernel.org/enter_bug.cgi?product=File%20System&component=f2fs
+
Background and Design issues
============================
@@ -101,160 +105,269 @@ Mount Options
=============
-====================== ============================================================
-background_gc=%s Turn on/off cleaning operations, namely garbage
- collection, triggered in background when I/O subsystem is
- idle. If background_gc=on, it will turn on the garbage
- collection and if background_gc=off, garbage collection
- will be turned off. If background_gc=sync, it will turn
- on synchronous garbage collection running in background.
- Default value for this option is on. So garbage
- collection is on by default.
-disable_roll_forward Disable the roll-forward recovery routine
-norecovery Disable the roll-forward recovery routine, mounted read-
- only (i.e., -o ro,disable_roll_forward)
-discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
- enabled, f2fs will issue discard/TRIM commands when a
- segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
-nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
- by default if CONFIG_F2FS_FS_XATTR is selected.
-noacl Disable POSIX Access Control List. Note: acl is enabled
- by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
-active_logs=%u Support configuring the number of active logs. In the
- current design, f2fs supports only 2, 4, and 6 logs.
- Default number is 6.
-disable_ext_identify Disable the extension list configured by mkfs, so f2fs
- does not aware of cold files such as media files.
-inline_xattr Enable the inline xattrs feature.
-noinline_xattr Disable the inline xattrs feature.
-inline_xattr_size=%u Support configuring inline xattr size, it depends on
- flexible inline xattr feature.
-inline_data Enable the inline data feature: New created small(<~3.4k)
- files can be written into inode block.
-inline_dentry Enable the inline dir feature: data in new created
- directory entries can be written into inode block. The
- space of inode block which is used to store inline
- dentries is limited to ~3.4k.
-noinline_dentry Disable the inline dentry feature.
-flush_merge Merge concurrent cache_flush commands as much as possible
- to eliminate redundant command issues. If the underlying
- device handles the cache_flush command relatively slowly,
- recommend to enable this option.
-nobarrier This option can be used if underlying storage guarantees
- its cached data should be written to the novolatile area.
- If this option is set, no cache_flush commands are issued
- but f2fs still guarantees the write ordering of all the
- data writes.
-fastboot This option is used when a system wants to reduce mount
- time as much as possible, even though normal performance
- can be sacrificed.
-extent_cache Enable an extent cache based on rb-tree, it can cache
- as many as extent which map between contiguous logical
- address and physical address per inode, resulting in
- increasing the cache hit ratio. Set by default.
-noextent_cache Disable an extent cache based on rb-tree explicitly, see
- the above extent_cache mount option.
-noinline_data Disable the inline data feature, inline data feature is
- enabled by default.
-data_flush Enable data flushing before checkpoint in order to
- persist data of regular and symlink.
-reserve_root=%d Support configuring reserved space which is used for
- allocation from a privileged user with specified uid or
- gid, unit: 4KB, the default limit is 0.2% of user blocks.
-resuid=%d The user ID which may use the reserved blocks.
-resgid=%d The group ID which may use the reserved blocks.
-fault_injection=%d Enable fault injection in all supported types with
- specified injection rate.
-fault_type=%d Support configuring fault injection type, should be
- enabled with fault_injection option, fault type value
- is shown below, it supports single or combined type.
-
- =================== ===========
- Type_Name Type_Value
- =================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- =================== ===========
-mode=%s Control block allocation mode which supports "adaptive"
- and "lfs". In "lfs" mode, there should be no random
- writes towards main area.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
-usrquota Enable plain user disk quota accounting.
-grpquota Enable plain group disk quota accounting.
-prjquota Enable plain project quota accounting.
-usrjquota=<file> Appoint specified file and type during mount, so that quota
-grpjquota=<file> information can be properly updated during recovery flow,
-prjjquota=<file> <quota file>: must be in root directory;
-jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
-offusrjquota Turn off user journelled quota.
-offgrpjquota Turn off group journelled quota.
-offprjjquota Turn off project journelled quota.
-quota Enable plain user disk quota accounting.
-noquota Disable all plain disk quota option.
-whint_mode=%s Control which write hints are passed down to block
- layer. This supports "off", "user-based", and
- "fs-based". In "off" mode (default), f2fs does not pass
- down hints. In "user-based" mode, f2fs tries to pass
- down hints given by users. And in "fs-based" mode, f2fs
- passes down hints with its policy.
-alloc_mode=%s Adjust block allocation policy, which supports "reuse"
- and "default".
-fsync_mode=%s Control the policy of fsync. Currently supports "posix",
- "strict", and "nobarrier". In "posix" mode, which is
- default, fsync will follow POSIX semantics and does a
- light operation to improve the filesystem performance.
- In "strict" mode, fsync will be heavy and behaves in line
- with xfs, ext4 and btrfs, where xfstest generic/342 will
- pass, but the performance will regress. "nobarrier" is
- based on "posix", but doesn't issue flush command for
- non-atomic files likewise "nobarrier" mount option.
-test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt
- context. The fake fscrypt context is used by xfstests.
-checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
- to reenable checkpointing. Is enabled by default. While
- disabled, any unmounting or unexpected shutdowns will cause
- the filesystem contents to appear as they did when the
- filesystem was mounted with that option.
- While mounting with checkpoint=disabled, the filesystem must
- run garbage collection to ensure that all available space can
- be used. If this takes too much time, the mount may return
- EAGAIN. You may optionally add a value to indicate how much
- of the disk you would be willing to temporarily give up to
- avoid additional garbage collection. This can be given as a
- number of blocks, or as a percent. For instance, mounting
- with checkpoint=disable:100% would always succeed, but it may
- hide up to all remaining free space. The actual space that
- would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
- This space is reclaimed once checkpoint=enable.
-compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
- "lz4" and "zstd" algorithm.
-compress_log_size=%u Support configuring compress cluster size, the size will
- be 4KB * (1 << %u), 16KB is minimum size, also it's
- default size.
-compress_extension=%s Support adding specified extension, so that f2fs can enable
- compression on those corresponding files, e.g. if all files
- with '.ext' has high compression rate, we can set the '.ext'
- on compression extension list and enable compression on
- these file by default rather than to enable it via ioctl.
- For other files, we can still enable compression via ioctl.
-====================== ============================================================
+======================== ============================================================
+background_gc=%s Turn on/off cleaning operations, namely garbage
+ collection, triggered in background when I/O subsystem is
+ idle. If background_gc=on, it will turn on the garbage
+ collection and if background_gc=off, garbage collection
+ will be turned off. If background_gc=sync, it will turn
+ on synchronous garbage collection running in background.
+ Default value for this option is on. So garbage
+ collection is on by default.
+gc_merge When background_gc is on, this option can be enabled to
+ let background GC thread to handle foreground GC requests,
+ it can eliminate the sluggish issue caused by slow foreground
+ GC operation when GC is triggered from a process with limited
+ I/O and CPU resources.
+nogc_merge Disable GC merge feature.
+disable_roll_forward Disable the roll-forward recovery routine
+norecovery Disable the roll-forward recovery routine, mounted read-
+ only (i.e., -o ro,disable_roll_forward)
+discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
+ enabled, f2fs will issue discard/TRIM commands when a
+ segment is cleaned.
+no_heap Disable heap-style segment allocation which finds free
+ segments for data from the beginning of main area, while
+ for node from the end of main area.
+nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
+ by default if CONFIG_F2FS_FS_XATTR is selected.
+noacl Disable POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
+active_logs=%u Support configuring the number of active logs. In the
+ current design, f2fs supports only 2, 4, and 6 logs.
+ Default number is 6.
+disable_ext_identify Disable the extension list configured by mkfs, so f2fs
+ is not aware of cold files such as media files.
+inline_xattr Enable the inline xattrs feature.
+noinline_xattr Disable the inline xattrs feature.
+inline_xattr_size=%u Support configuring inline xattr size, it depends on
+ flexible inline xattr feature.
+inline_data Enable the inline data feature: Newly created small (<~3.4k)
+ files can be written into inode block.
+inline_dentry Enable the inline dir feature: data in newly created
+ directory entries can be written into inode block. The
+ space of inode block which is used to store inline
+ dentries is limited to ~3.4k.
+noinline_dentry Disable the inline dentry feature.
+flush_merge Merge concurrent cache_flush commands as much as possible
+ to eliminate redundant command issues. If the underlying
+ device handles the cache_flush command relatively slowly,
+ recommend to enable this option.
+nobarrier This option can be used if underlying storage guarantees
+ its cached data should be written to the novolatile area.
+ If this option is set, no cache_flush commands are issued
+ but f2fs still guarantees the write ordering of all the
+ data writes.
+barrier If this option is set, cache_flush commands are allowed to be
+ issued.
+fastboot This option is used when a system wants to reduce mount
+ time as much as possible, even though normal performance
+ can be sacrificed.
+extent_cache Enable an extent cache based on rb-tree, it can cache
+ as many as extent which map between contiguous logical
+ address and physical address per inode, resulting in
+ increasing the cache hit ratio. Set by default.
+noextent_cache Disable an extent cache based on rb-tree explicitly, see
+ the above extent_cache mount option.
+noinline_data Disable the inline data feature, inline data feature is
+ enabled by default.
+data_flush Enable data flushing before checkpoint in order to
+ persist data of regular and symlink.
+reserve_root=%d Support configuring reserved space which is used for
+ allocation from a privileged user with specified uid or
+ gid, unit: 4KB, the default limit is 0.2% of user blocks.
+resuid=%d The user ID which may use the reserved blocks.
+resgid=%d The group ID which may use the reserved blocks.
+fault_injection=%d Enable fault injection in all supported types with
+ specified injection rate.
+fault_type=%d Support configuring fault injection type, should be
+ enabled with fault_injection option, fault type value
+ is shown below, it supports single or combined type.
+
+ =================== ===========
+ Type_Name Type_Value
+ =================== ===========
+ FAULT_KMALLOC 0x000000001
+ FAULT_KVMALLOC 0x000000002
+ FAULT_PAGE_ALLOC 0x000000004
+ FAULT_PAGE_GET 0x000000008
+ FAULT_ALLOC_BIO 0x000000010 (obsolete)
+ FAULT_ALLOC_NID 0x000000020
+ FAULT_ORPHAN 0x000000040
+ FAULT_BLOCK 0x000000080
+ FAULT_DIR_DEPTH 0x000000100
+ FAULT_EVICT_INODE 0x000000200
+ FAULT_TRUNCATE 0x000000400
+ FAULT_READ_IO 0x000000800
+ FAULT_CHECKPOINT 0x000001000
+ FAULT_DISCARD 0x000002000
+ FAULT_WRITE_IO 0x000004000
+ FAULT_SLAB_ALLOC 0x000008000
+ FAULT_DQUOT_INIT 0x000010000
+ FAULT_LOCK_OP 0x000020000
+ FAULT_BLKADDR 0x000040000
+ =================== ===========
+mode=%s Control block allocation mode which supports "adaptive"
+ and "lfs". In "lfs" mode, there should be no random
+ writes towards main area.
+ "fragment:segment" and "fragment:block" are newly added here.
+ These are developer options for experiments to simulate filesystem
+ fragmentation/after-GC situation itself. The developers use these
+ modes to understand filesystem fragmentation/after-GC condition well,
+ and eventually get some insights to handle them better.
+ In "fragment:segment", f2fs allocates a new segment in ramdom
+ position. With this, we can simulate the after-GC condition.
+ In "fragment:block", we can scatter block allocation with
+ "max_fragment_chunk" and "max_fragment_hole" sysfs nodes.
+ We added some randomness to both chunk and hole size to make
+ it close to realistic IO pattern. So, in this mode, f2fs will allocate
+ 1..<max_fragment_chunk> blocks in a chunk and make a hole in the
+ length of 1..<max_fragment_hole> by turns. With this, the newly
+ allocated blocks will be scattered throughout the whole partition.
+ Note that "fragment:block" implicitly enables "fragment:segment"
+ option for more randomness.
+ Please, use these options for your experiments and we strongly
+ recommend to re-format the filesystem after using these options.
+io_bits=%u Set the bit size of write IO requests. It should be set
+ with "mode=lfs".
+usrquota Enable plain user disk quota accounting.
+grpquota Enable plain group disk quota accounting.
+prjquota Enable plain project quota accounting.
+usrjquota=<file> Appoint specified file and type during mount, so that quota
+grpjquota=<file> information can be properly updated during recovery flow,
+prjjquota=<file> <quota file>: must be in root directory;
+jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
+offusrjquota Turn off user journalled quota.
+offgrpjquota Turn off group journalled quota.
+offprjjquota Turn off project journalled quota.
+quota Enable plain user disk quota accounting.
+noquota Disable all plain disk quota option.
+alloc_mode=%s Adjust block allocation policy, which supports "reuse"
+ and "default".
+fsync_mode=%s Control the policy of fsync. Currently supports "posix",
+ "strict", and "nobarrier". In "posix" mode, which is
+ default, fsync will follow POSIX semantics and does a
+ light operation to improve the filesystem performance.
+ In "strict" mode, fsync will be heavy and behaves in line
+ with xfs, ext4 and btrfs, where xfstest generic/342 will
+ pass, but the performance will regress. "nobarrier" is
+ based on "posix", but doesn't issue flush command for
+ non-atomic files likewise "nobarrier" mount option.
+test_dummy_encryption
+test_dummy_encryption=%s
+ Enable dummy encryption, which provides a fake fscrypt
+ context. The fake fscrypt context is used by xfstests.
+ The argument may be either "v1" or "v2", in order to
+ select the corresponding fscrypt policy version.
+checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
+ to reenable checkpointing. Is enabled by default. While
+ disabled, any unmounting or unexpected shutdowns will cause
+ the filesystem contents to appear as they did when the
+ filesystem was mounted with that option.
+ While mounting with checkpoint=disable, the filesystem must
+ run garbage collection to ensure that all available space can
+ be used. If this takes too much time, the mount may return
+ EAGAIN. You may optionally add a value to indicate how much
+ of the disk you would be willing to temporarily give up to
+ avoid additional garbage collection. This can be given as a
+ number of blocks, or as a percent. For instance, mounting
+ with checkpoint=disable:100% would always succeed, but it may
+ hide up to all remaining free space. The actual space that
+ would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
+ This space is reclaimed once checkpoint=enable.
+checkpoint_merge When checkpoint is enabled, this can be used to create a kernel
+ daemon and make it to merge concurrent checkpoint requests as
+ much as possible to eliminate redundant checkpoint issues. Plus,
+ we can eliminate the sluggish issue caused by slow checkpoint
+ operation when the checkpoint is done in a process context in
+ a cgroup having low i/o budget and cpu shares. To make this
+ do better, we set the default i/o priority of the kernel daemon
+ to "3", to give one higher priority than other kernel threads.
+ This is the same way to give a I/O priority to the jbd2
+ journaling thread of ext4 filesystem.
+nocheckpoint_merge Disable checkpoint merge feature.
+compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
+ "lz4", "zstd" and "lzo-rle" algorithm.
+compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only
+ "lz4" and "zstd" support compress level config.
+ algorithm level range
+ lz4 3 - 16
+ zstd 1 - 22
+compress_log_size=%u Support configuring compress cluster size. The size will
+ be 4KB * (1 << %u). The default and minimum sizes are 16KB.
+compress_extension=%s Support adding specified extension, so that f2fs can enable
+ compression on those corresponding files, e.g. if all files
+ with '.ext' has high compression rate, we can set the '.ext'
+ on compression extension list and enable compression on
+ these file by default rather than to enable it via ioctl.
+ For other files, we can still enable compression via ioctl.
+ Note that, there is one reserved special extension '*', it
+ can be set to enable compression for all files.
+nocompress_extension=%s Support adding specified extension, so that f2fs can disable
+ compression on those corresponding files, just contrary to compression extension.
+ If you know exactly which files cannot be compressed, you can use this.
+ The same extension name can't appear in both compress and nocompress
+ extension at the same time.
+ If the compress extension specifies all files, the types specified by the
+ nocompress extension will be treated as special cases and will not be compressed.
+ Don't allow use '*' to specifie all file in nocompress extension.
+ After add nocompress_extension, the priority should be:
+ dir_flag < comp_extention,nocompress_extension < comp_file_flag,no_comp_file_flag.
+ See more in compression sections.
+
+compress_chksum Support verifying chksum of raw data in compressed cluster.
+compress_mode=%s Control file compression mode. This supports "fs" and "user"
+ modes. In "fs" mode (default), f2fs does automatic compression
+ on the compression enabled files. In "user" mode, f2fs disables
+ the automaic compression and gives the user discretion of
+ choosing the target file and the timing. The user can do manual
+ compression/decompression on the compression enabled files using
+ ioctls.
+compress_cache Support to use address space of a filesystem managed inode to
+ cache compressed block, in order to improve cache hit ratio of
+ random read.
+inlinecrypt When possible, encrypt/decrypt the contents of encrypted
+ files using the blk-crypto framework rather than
+ filesystem-layer encryption. This allows the use of
+ inline encryption hardware. The on-disk format is
+ unaffected. For more details, see
+ Documentation/block/inline-encryption.rst.
+atgc Enable age-threshold garbage collection, it provides high
+ effectiveness and efficiency on background GC.
+discard_unit=%s Control discard unit, the argument can be "block", "segment"
+ and "section", issued discard command's offset/size will be
+ aligned to the unit, by default, "discard_unit=block" is set,
+ so that small discard functionality is enabled.
+ For blkzoned device, "discard_unit=section" will be set by
+ default, it is helpful for large sized SMR or ZNS devices to
+ reduce memory cost by getting rid of fs metadata supports small
+ discard.
+memory=%s Control memory mode. This supports "normal" and "low" modes.
+ "low" mode is introduced to support low memory devices.
+ Because of the nature of low memory devices, in this mode, f2fs
+ will try to save memory sometimes by sacrificing performance.
+ "normal" mode is the default mode and same as before.
+age_extent_cache Enable an age extent cache based on rb-tree. It records
+ data block update frequency of the extent per inode, in
+ order to provide better temperature hints for data block
+ allocation.
+errors=%s Specify f2fs behavior on critical errors. This supports modes:
+ "panic", "continue" and "remount-ro", respectively, trigger
+ panic immediately, continue without doing anything, and remount
+ the partition in read-only mode. By default it uses "continue"
+ mode.
+ ====================== =============== =============== ========
+ mode continue remount-ro panic
+ ====================== =============== =============== ========
+ access ops normal normal N/A
+ syscall errors -EIO -EROFS N/A
+ mount option rw ro N/A
+ pending dir write keep keep N/A
+ pending non-dir write drop keep N/A
+ pending node write drop keep N/A
+ pending meta write keep keep N/A
+ ====================== =============== =============== ========
+======================== ============================================================
Debugfs Entries
===============
@@ -289,7 +402,7 @@ Usage
# insmod f2fs.ko
-3. Create a directory trying to mount::
+3. Create a directory to use when mounting::
# mkdir /mnt/f2fs
@@ -303,7 +416,7 @@ mkfs.f2fs
The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
which builds a basic on-disk layout.
-The options consist of:
+The quick options consist of:
=============== ===========================================================
``-l [label]`` Give a volume label, up to 512 unicode name.
@@ -325,6 +438,8 @@ The options consist of:
1 is set by default, which conducts discard.
=============== ===========================================================
+Note: please refer to the manpage of mkfs.f2fs(8) to get full option list.
+
fsck.f2fs
---------
The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
@@ -332,10 +447,12 @@ partition, which examines whether the filesystem metadata and user-made data
are cross-referenced correctly or not.
Note that, initial version of the tool does not fix any inconsistency.
-The options consist of::
+The quick options consist of::
-d debug level [default:0]
+Note: please refer to the manpage of fsck.f2fs(8) to get full option list.
+
dump.f2fs
---------
The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
@@ -359,6 +476,37 @@ Examples::
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
+Note: please refer to the manpage of dump.f2fs(8) to get full option list.
+
+sload.f2fs
+----------
+The sload.f2fs gives a way to insert files and directories in the existing disk
+image. This tool is useful when building f2fs images given compiled files.
+
+Note: please refer to the manpage of sload.f2fs(8) to get full option list.
+
+resize.f2fs
+-----------
+The resize.f2fs lets a user resize the f2fs-formatted disk image, while preserving
+all the files and directories stored in the image.
+
+Note: please refer to the manpage of resize.f2fs(8) to get full option list.
+
+defrag.f2fs
+-----------
+The defrag.f2fs can be used to defragment scattered written data as well as
+filesystem metadata across the disk. This can improve the write speed by giving
+more free consecutive space.
+
+Note: please refer to the manpage of defrag.f2fs(8) to get full option list.
+
+f2fs_io
+-------
+The f2fs_io is a simple tool to issue various filesystem APIs as well as
+f2fs-specific ones, which is very useful for QA tests.
+
+Note: please refer to the manpage of f2fs_io(8) to get full option list.
+
Design
======
@@ -371,7 +519,7 @@ consists of a set of sections. By default, section and zone sizes are set to one
segment size identically, but users can easily modify the sizes by mkfs.
F2FS splits the entire volume into six areas, and all the areas except superblock
-consists of multiple segments as described below::
+consist of multiple segments as described below::
align with the zone size <-|
|-> align with the segment size
@@ -474,7 +622,7 @@ one inode block (i.e., a file) covers::
`- direct node (1018)
`- data (1018)
-Note that, all the node blocks are mapped by NAT which means the location of
+Note that all the node blocks are mapped by NAT which means the location of
each node is translated by the NAT table. In the consideration of the wandering
tree problem, F2FS is able to cut off the propagation of node updates caused by
leaf data writes.
@@ -554,7 +702,7 @@ When F2FS finds a file name in a directory, at first a hash value of the file
name is calculated. Then, F2FS scans the hash table in level #0 to find the
dentry consisting of the file name and its inode number. If not found, F2FS
scans the next hash table in level #1. In this way, F2FS scans hash tables in
-each levels incrementally from 1 to N. In each levels F2FS needs to scan only
+each levels incrementally from 1 to N. In each level F2FS needs to scan only
one bucket determined by the following equation, which shows O(log(# of files))
complexity::
@@ -628,74 +776,10 @@ In order to identify whether the data in the victim segment are valid or not,
F2FS manages a bitmap. Each bit represents the validity of a block, and the
bitmap is composed of a bit stream covering whole blocks in main area.
-Write-hint Policy
------------------
-
-1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
-
-2) whint_mode=user-based. F2FS tries to pass down hints given by
-users.
-
-===================== ======================== ===================
-User F2FS Block
-===================== ======================== ===================
- META WRITE_LIFE_NOT_SET
- HOT_NODE "
- WARM_NODE "
- COLD_NODE "
-ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
-extension list " "
-
--- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
-
--- direct io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " WRITE_LIFE_NONE
-WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
-WRITE_LIFE_LONG " WRITE_LIFE_LONG
-===================== ======================== ===================
-
-3) whint_mode=fs-based. F2FS passes down hints with its policy.
-
-===================== ======================== ===================
-User F2FS Block
-===================== ======================== ===================
- META WRITE_LIFE_MEDIUM;
- HOT_NODE WRITE_LIFE_NOT_SET
- WARM_NODE "
- COLD_NODE WRITE_LIFE_NONE
-ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
-extension list " "
-
--- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
-
--- direct io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " WRITE_LIFE_NONE
-WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
-WRITE_LIFE_LONG " WRITE_LIFE_LONG
-===================== ======================== ===================
-
Fallocate(2) Policy
-------------------
-The default policy follows the below posix rule.
+The default policy follows the below POSIX rule.
Allocating disk space
The default operation (i.e., mode is zero) of fallocate() allocates
@@ -708,7 +792,7 @@ Allocating disk space
as a method of optimally implementing that function.
However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
-fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having
+fallocate(fd, DEFAULT_MODE), it allocates on-disk block addresses having
zero or random data, which is useful to the below scenario where:
1. create(fd)
@@ -727,20 +811,49 @@ Compression implementation
cluster can be compressed or not.
- In cluster metadata layout, one special block address is used to indicate
- cluster is compressed one or normal one, for compressed cluster, following
+ a cluster is a compressed one or normal one; for compressed cluster, following
metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
stores data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
- all logical blocks in file are valid and cluster compress ratio is lower
- than specified threshold.
+ all logical blocks in cluster contain valid data and compress ratio of
+ cluster data is lower than specified threshold.
-- To enable compression on regular inode, there are three ways:
+- To enable compression on regular inode, there are four ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
+ * mount w/ -o compress_extension=*; touch any_file
+
+- To disable compression on regular inode, there are two ways:
+
+ * chattr -c file
+ * mount w/ -o nocompress_extension=ext; touch file.ext
+
+- Priority in between FS_COMPR_FL, FS_NOCOMP_FS, extensions:
+
+ * compress_extension=so; nocompress_extension=zip; chattr +c dir; touch
+ dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so and baz.txt
+ should be compresse, bar.zip should be non-compressed. chattr +c dir/bar.zip
+ can enable compress on bar.zip.
+ * compress_extension=so; nocompress_extension=zip; chattr -c dir; touch
+ dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so should be
+ compresse, bar.zip and baz.txt should be non-compressed.
+ chattr+c dir/bar.zip; chattr+c dir/baz.txt; can enable compress on bar.zip
+ and baz.txt.
+
+- At this point, compression feature doesn't expose compressed space to user
+ directly in order to guarantee potential data updates later to the space.
+ Instead, the main goal is to reduce data writes to flash disk as much as
+ possible, resulting in extending disk life time as well as relaxing IO
+ congestion. Alternatively, we've added ioctl(F2FS_IOC_RELEASE_COMPRESS_BLOCKS)
+ interface to reclaim compressed space and show it to user after setting a
+ special flag to the inode. Once the compressed space is released, the flag
+ will block writing data to the file until either the compressed space is
+ reserved via ioctl(F2FS_IOC_RESERVE_COMPRESS_BLOCKS) or the file size is
+ truncated to zero.
Compress metadata layout::
@@ -749,14 +862,57 @@ Compress metadata layout::
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
- . . . .
+ . . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
- . .
+ . .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
+
+Compression mode
+--------------------------
+
+f2fs supports "fs" and "user" compression modes with "compression_mode" mount option.
+With this option, f2fs provides a choice to select the way how to compress the
+compression enabled files (refer to "Compression implementation" section for how to
+enable compression on a regular inode).
+
+1) compress_mode=fs
+This is the default option. f2fs does automatic compression in the writeback of the
+compression enabled files.
+
+2) compress_mode=user
+This disables the automatic compression and gives the user discretion of choosing the
+target file and the timing. The user can do manual compression/decompression on the
+compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
+ioctls like the below.
+
+To decompress a file,
+
+fd = open(filename, O_WRONLY, 0);
+ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
+
+To compress a file,
+
+fd = open(filename, O_WRONLY, 0);
+ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
+
+NVMe Zoned Namespace devices
+----------------------------
+
+- ZNS defines a per-zone capacity which can be equal or less than the
+ zone-size. Zone-capacity is the number of usable blocks in the zone.
+ F2FS checks if zone-capacity is less than zone-size, if it is, then any
+ segment which starts after the zone-capacity is marked as not-free in
+ the free segment bitmap at initial mount time. These segments are marked
+ as permanently used so they are not allocated for writes and
+ consequently are not needed to be garbage collected. In case the
+ zone-capacity is not aligned to default segment size(2MB), then a segment
+ can start before the zone-capacity and span across zone-capacity boundary.
+ Such spanning segments are also considered as usable segments. All blocks
+ past the zone-capacity are considered unusable in these segments.
diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.rst
index ac87e6fda842..93fc96f760aa 100644
--- a/Documentation/filesystems/fiemap.txt
+++ b/Documentation/filesystems/fiemap.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
============
Fiemap Ioctl
============
@@ -10,9 +12,9 @@ returns a list of extents.
Request Basics
--------------
-A fiemap request is encoded within struct fiemap:
+A fiemap request is encoded within struct fiemap::
-struct fiemap {
+ struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which
@@ -23,7 +25,7 @@ struct fiemap {
__u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
-};
+ };
fm_start, and fm_length specify the logical range within the file
@@ -51,12 +53,12 @@ nothing to prevent the file from changing between calls to FIEMAP.
The following flags can be set in fm_flags:
-* FIEMAP_FLAG_SYNC
-If this flag is set, the kernel will sync the file before mapping extents.
+FIEMAP_FLAG_SYNC
+ If this flag is set, the kernel will sync the file before mapping extents.
-* FIEMAP_FLAG_XATTR
-If this flag is set, the extents returned will describe the inodes
-extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_XATTR
+ If this flag is set, the extents returned will describe the inodes
+ extended attribute lookup tree, instead of its data tree.
Extent Mapping
@@ -75,18 +77,18 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents.
-
-struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
-};
+returned in fm_extents::
+
+ struct fiemap_extent {
+ __u64 fe_logical; /* logical offset in bytes for the start of
+ * the extent */
+ __u64 fe_physical; /* physical offset in bytes for the start
+ * of the extent */
+ __u64 fe_length; /* length in bytes for the extent */
+ __u64 fe_reserved64[2];
+ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
+ __u32 fe_reserved[3];
+ };
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -114,26 +116,27 @@ worry about all present and future flags which might imply unaligned
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
-* FIEMAP_EXTENT_LAST
-This is generally the last extent in the file. A mapping attempt past
-this extent may return nothing. Some implementations set this flag to
-indicate this extent is the last one in the range queried by the user
-(via fiemap->fm_length).
+FIEMAP_EXTENT_LAST
+ This is generally the last extent in the file. A mapping attempt past
+ this extent may return nothing. Some implementations set this flag to
+ indicate this extent is the last one in the range queried by the user
+ (via fiemap->fm_length).
+
+FIEMAP_EXTENT_UNKNOWN
+ The location of this extent is currently unknown. This may indicate
+ the data is stored on an inaccessible volume or that no storage has
+ been allocated for the file yet.
-* FIEMAP_EXTENT_UNKNOWN
-The location of this extent is currently unknown. This may indicate
-the data is stored on an inaccessible volume or that no storage has
-been allocated for the file yet.
+FIEMAP_EXTENT_DELALLOC
+ This will also set FIEMAP_EXTENT_UNKNOWN.
-* FIEMAP_EXTENT_DELALLOC
- - This will also set FIEMAP_EXTENT_UNKNOWN.
-Delayed allocation - while there is data for this extent, its
-physical location has not been allocated yet.
+ Delayed allocation - while there is data for this extent, its
+ physical location has not been allocated yet.
-* FIEMAP_EXTENT_ENCODED
-This extent does not consist of plain filesystem blocks but is
-encoded (e.g. encrypted or compressed). Reading the data in this
-extent via I/O to the block device will have undefined results.
+FIEMAP_EXTENT_ENCODED
+ This extent does not consist of plain filesystem blocks but is
+ encoded (e.g. encrypted or compressed). Reading the data in this
+ extent via I/O to the block device will have undefined results.
Note that it is *always* undefined to try to update the data
in-place by writing to the indicated location without the
@@ -145,32 +148,32 @@ unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is
clear; user applications must not try reading or writing to the
filesystem via the block device under any other circumstances.
-* FIEMAP_EXTENT_DATA_ENCRYPTED
- - This will also set FIEMAP_EXTENT_ENCODED
-The data in this extent has been encrypted by the file system.
+FIEMAP_EXTENT_DATA_ENCRYPTED
+ This will also set FIEMAP_EXTENT_ENCODED
+ The data in this extent has been encrypted by the file system.
-* FIEMAP_EXTENT_NOT_ALIGNED
-Extent offsets and length are not guaranteed to be block aligned.
+FIEMAP_EXTENT_NOT_ALIGNED
+ Extent offsets and length are not guaranteed to be block aligned.
-* FIEMAP_EXTENT_DATA_INLINE
+FIEMAP_EXTENT_DATA_INLINE
This will also set FIEMAP_EXTENT_NOT_ALIGNED
-Data is located within a meta data block.
+ Data is located within a meta data block.
-* FIEMAP_EXTENT_DATA_TAIL
+FIEMAP_EXTENT_DATA_TAIL
This will also set FIEMAP_EXTENT_NOT_ALIGNED
-Data is packed into a block with data from other files.
+ Data is packed into a block with data from other files.
-* FIEMAP_EXTENT_UNWRITTEN
-Unwritten extent - the extent is allocated but its data has not been
-initialized. This indicates the extent's data will be all zero if read
-through the filesystem but the contents are undefined if read directly from
-the device.
+FIEMAP_EXTENT_UNWRITTEN
+ Unwritten extent - the extent is allocated but its data has not been
+ initialized. This indicates the extent's data will be all zero if read
+ through the filesystem but the contents are undefined if read directly from
+ the device.
-* FIEMAP_EXTENT_MERGED
-This will be set when a file does not support extents, i.e., it uses a block
-based addressing scheme. Since returning an extent for each block back to
-userspace would be highly inefficient, the kernel will try to merge most
-adjacent blocks into 'extents'.
+FIEMAP_EXTENT_MERGED
+ This will be set when a file does not support extents, i.e., it uses a block
+ based addressing scheme. Since returning an extent for each block back to
+ userspace would be highly inefficient, the kernel will try to merge most
+ adjacent blocks into 'extents'.
VFS -> File System Implementation
@@ -179,23 +182,23 @@ VFS -> File System Implementation
File systems wishing to support fiemap must implement a ->fiemap callback on
their inode_operations structure. The fs ->fiemap call is responsible for
defining its set of supported fiemap flags, and calling a helper function on
-each discovered extent:
+each discovered extent::
-struct inode_operations {
+ struct inode_operations {
...
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request:
+fiemap request::
-struct fiemap_extent_info {
+ struct fiemap_extent_info {
unsigned int fi_flags; /* Flags as passed from user */
unsigned int fi_extents_mapped; /* Number of mapped extents */
unsigned int fi_extents_max; /* Size of fiemap_extent array */
struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
-};
+ };
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
@@ -203,23 +206,25 @@ EINTR once fatal signal received.
Flag checking should be done at the beginning of the ->fiemap callback via the
-fiemap_check_flags() helper:
+fiemap_prep() helper::
-int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+ int fiemap_prep(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 *len, u32 supported_flags);
The struct fieinfo should be passed in as received from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If
-fiemap_check_flags finds invalid user flags, it will place the bad values in
+fiemap_prep finds invalid user flags, it will place the bad values in
fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
-fiemap_check_flags(), it should immediately exit, returning that error back to
-ioctl_fiemap().
+fiemap_prep(), it should immediately exit, returning that error back to
+ioctl_fiemap(). Additionally the range is validate against the supported
+maximum file size.
For each extent in the request range, the file system should call
-the helper function, fiemap_fill_next_extent():
+the helper function, fiemap_fill_next_extent()::
-int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
- u64 phys, u64 len, u32 flags, u32 dev);
+ int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+ u64 phys, u64 len, u32 flags, u32 dev);
fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.rst
index 46dfc6b038c3..9e38e4c221ca 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
File management in the Linux kernel
------------------------------------
+===================================
This document describes how locking for files (struct file)
and file descriptor table (struct files) works.
@@ -34,7 +37,7 @@ appear atomic. Here are the locking rules for
the fdtable structure -
1. All references to the fdtable must be done through
- the files_fdtable() macro :
+ the files_fdtable() macro::
struct fdtable *fdt;
@@ -59,54 +62,35 @@ the fdtable structure -
be held.
4. To look up the file structure given an fd, a reader
- must use either fcheck() or fcheck_files() APIs. These
+ must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.
- An example :
+
+ An example::
struct file *file;
rcu_read_lock();
- file = fcheck(fd);
- if (file) {
- ...
- }
- ....
+ file = lookup_fdget_rcu(fd);
rcu_read_unlock();
-
-5. Handling of the file structures is special. Since the look-up
- of the fd (fget()/fget_light()) are lock-free, it is possible
- that look-up may race with the last put() operation on the
- file structure. This is avoided using atomic_long_inc_not_zero()
- on ->f_count :
-
- rcu_read_lock();
- file = fcheck_files(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
- *fput_needed = 1;
- else
- /* Didn't get the reference, someone's freed */
- file = NULL;
+ ...
+ fput(file);
}
- rcu_read_unlock();
....
- return file;
-
- atomic_long_inc_not_zero() detects if refcounts is already zero or
- goes to zero during increment. If it does, we fail
- fget()/fget_light().
-6. Since both fdtable and file structures can be looked up
+5. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
- and fcheck()/fcheck_files() which take care of these issues.
+ and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
+ issues.
-7. While updating, the fdtable pointer must be looked up while
+6. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
- For example :
+
+ For example::
spin_lock(&files->file_lock);
fd = locate_fd(files, file, start);
@@ -121,3 +105,19 @@ the fdtable structure -
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
+to just acquire a reference to the file in question under rcu using
+atomic_long_inc_not_zero() since the file might have already been
+recycled and someone else might have bumped the reference. In other
+words, callers might see reference count bumps from newer users. For
+this is reason it is necessary to verify that the pointer is the same
+before and after the reference count increment. This pattern can be seen
+in get_file_rcu() and __files_get_rcu().
+
+In addition, it isn't possible to access or check fields in struct file
+without first aqcuiring a reference on it under rcu lookup. Not doing
+that was always very dodgy and it was only usable for non-pointer data
+in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
+either first acquire a reference or they must hold the files_lock of the
+fdtable.
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index aa072112cfff..e86b886b64d0 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -31,15 +31,15 @@ However, except for filenames, fscrypt does not encrypt filesystem
metadata.
Unlike eCryptfs, which is a stacked filesystem, fscrypt is integrated
-directly into supported filesystems --- currently ext4, F2FS, and
-UBIFS. This allows encrypted files to be read and written without
-caching both the decrypted and encrypted pages in the pagecache,
-thereby nearly halving the memory used and bringing it in line with
-unencrypted files. Similarly, half as many dentries and inodes are
-needed. eCryptfs also limits encrypted filenames to 143 bytes,
-causing application compatibility issues; fscrypt allows the full 255
-bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API can be
-used by unprivileged users, with no need to mount anything.
+directly into supported filesystems --- currently ext4, F2FS, UBIFS,
+and CephFS. This allows encrypted files to be read and written
+without caching both the decrypted and encrypted pages in the
+pagecache, thereby nearly halving the memory used and bringing it in
+line with unencrypted files. Similarly, half as many dentries and
+inodes are needed. eCryptfs also limits encrypted filenames to 143
+bytes, causing application compatibility issues; fscrypt allows the
+full 255 bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API
+can be used by unprivileged users, with no need to mount anything.
fscrypt does not support encrypting files in-place. Instead, it
supports marking an empty directory as encrypted. Then, after
@@ -77,11 +77,11 @@ Side-channel attacks
fscrypt is only resistant to side-channel attacks, such as timing or
electromagnetic attacks, to the extent that the underlying Linux
-Cryptographic API algorithms are. If a vulnerable algorithm is used,
-such as a table-based implementation of AES, it may be possible for an
-attacker to mount a side channel attack against the online system.
-Side channel attacks may also be mounted against applications
-consuming decrypted data.
+Cryptographic API algorithms or inline encryption hardware are. If a
+vulnerable algorithm is used, such as a table-based implementation of
+AES, it may be possible for an attacker to mount a side channel attack
+against the online system. Side channel attacks may also be mounted
+against applications consuming decrypted data.
Unauthorized file access
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -176,11 +176,11 @@ Master Keys
Each encrypted directory tree is protected by a *master key*. Master
keys can be up to 64 bytes long, and must be at least as long as the
-greater of the key length needed by the contents and filenames
-encryption modes being used. For example, if AES-256-XTS is used for
-contents encryption, the master key must be 64 bytes (512 bits). Note
-that the XTS mode is defined to require a key twice as long as that
-required by the underlying block cipher.
+greater of the security strength of the contents and filenames
+encryption modes being used. For example, if any AES-256 mode is
+used, the master key must be at least 256 bits, i.e. 32 bytes. A
+stricter requirement applies if the key is used by a v1 encryption
+policy and AES-256-XTS is used; such keys must be 64 bytes.
To "unlock" an encrypted directory tree, userspace must provide the
appropriate master key. There can be any number of master keys, each
@@ -261,9 +261,9 @@ DIRECT_KEY policies
The Adiantum encryption mode (see `Encryption modes and usage`_) is
suitable for both contents and filenames encryption, and it accepts
-long IVs --- long enough to hold both an 8-byte logical block number
-and a 16-byte per-file nonce. Also, the overhead of each Adiantum key
-is greater than that of an AES-256-XTS key.
+long IVs --- long enough to hold both an 8-byte data unit index and a
+16-byte per-file nonce. Also, the overhead of each Adiantum key is
+greater than that of an AES-256-XTS key.
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
@@ -292,8 +292,22 @@ files' data differently, inode numbers are included in the IVs.
Consequently, shrinking the filesystem may not be allowed.
This format is optimized for use with inline encryption hardware
-compliant with the UFS or eMMC standards, which support only 64 IV
-bits per I/O request and may have only a small number of keyslots.
+compliant with the UFS standard, which supports only 64 IV bits per
+I/O request and may have only a small number of keyslots.
+
+IV_INO_LBLK_32 policies
+-----------------------
+
+IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
+IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
+SipHash key is derived from the master key) and added to the file data
+unit index mod 2^32 to produce a 32-bit IV.
+
+This format is optimized for use with inline encryption hardware
+compliant with the eMMC v5.2 standard, which supports only 32 IV bits
+per I/O request and may have only a small number of keyslots. This
+format results in some level of IV reuse, so it should only be used
+when necessary due to hardware limitations.
Key identifiers
---------------
@@ -318,60 +332,181 @@ Encryption modes and usage
fscrypt allows one encryption mode to be specified for file contents
and one encryption mode to be specified for filenames. Different
directory trees are permitted to use different encryption modes.
+
+Supported modes
+---------------
+
Currently, the following pairs of encryption modes are supported:
- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
-- AES-128-CBC for contents and AES-128-CTS-CBC for filenames
+- AES-256-XTS for contents and AES-256-HCTR2 for filenames
- Adiantum for both contents and filenames
-
-If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair.
-
-AES-128-CBC was added only for low-powered embedded devices with
-crypto accelerators such as CAAM or CESA that do not support XTS. To
-use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or
-another SHA-256 implementation) must be enabled so that ESSIV can be
-used.
-
-Adiantum is a (primarily) stream cipher-based mode that is fast even
-on CPUs without dedicated crypto instructions. It's also a true
-wide-block mode, unlike XTS. It can also eliminate the need to derive
-per-file encryption keys. However, it depends on the security of two
-primitives, XChaCha12 and AES-256, rather than just one. See the
-paper "Adiantum: length-preserving encryption for entry-level
-processors" (https://eprint.iacr.org/2018/720.pdf) for more details.
-To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
-implementations of ChaCha and NHPoly1305 should be enabled, e.g.
-CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
-
-New encryption modes can be added relatively easily, without changes
-to individual filesystems. However, authenticated encryption (AE)
-modes are not currently supported because of the difficulty of dealing
-with ciphertext expansion.
+- AES-128-CBC-ESSIV for contents and AES-128-CTS-CBC for filenames
+- SM4-XTS for contents and SM4-CTS-CBC for filenames
+
+Authenticated encryption modes are not currently supported because of
+the difficulty of dealing with ciphertext expansion. Therefore,
+contents encryption uses a block cipher in `XTS mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS>`_ or
+`CBC-ESSIV mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_,
+or a wide-block cipher. Filenames encryption uses a
+block cipher in `CTS-CBC mode
+<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block
+cipher.
+
+The (AES-256-XTS, AES-256-CTS-CBC) pair is the recommended default.
+It is also the only option that is *guaranteed* to always be supported
+if the kernel supports fscrypt at all; see `Kernel config options`_.
+
+The (AES-256-XTS, AES-256-HCTR2) pair is also a good choice that
+upgrades the filenames encryption to use a wide-block cipher. (A
+*wide-block cipher*, also called a tweakable super-pseudorandom
+permutation, has the property that changing one bit scrambles the
+entire result.) As described in `Filenames encryption`_, a wide-block
+cipher is the ideal mode for the problem domain, though CTS-CBC is the
+"least bad" choice among the alternatives. For more information about
+HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_.
+
+Adiantum is recommended on systems where AES is too slow due to lack
+of hardware acceleration for AES. Adiantum is a wide-block cipher
+that uses XChaCha12 and AES-256 as its underlying components. Most of
+the work is done by XChaCha12, which is much faster than AES when AES
+acceleration is unavailable. For more information about Adiantum, see
+`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
+
+The (AES-128-CBC-ESSIV, AES-128-CTS-CBC) pair exists only to support
+systems whose only form of AES acceleration is an off-CPU crypto
+accelerator such as CAAM or CESA that does not support XTS.
+
+The remaining mode pairs are the "national pride ciphers":
+
+- (SM4-XTS, SM4-CTS-CBC)
+
+Generally speaking, these ciphers aren't "bad" per se, but they
+receive limited security review compared to the usual choices such as
+AES and ChaCha. They also don't bring much new to the table. It is
+suggested to only use these ciphers where their use is mandated.
+
+Kernel config options
+---------------------
+
+Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in
+only the basic support from the crypto API needed to use AES-256-XTS
+and AES-256-CTS-CBC encryption. For optimal performance, it is
+strongly recommended to also enable any available platform-specific
+kconfig options that provide acceleration for the algorithm(s) you
+wish to use. Support for any "non-default" encryption modes typically
+requires extra kconfig options as well.
+
+Below, some relevant options are listed by encryption mode. Note,
+acceleration options not listed below may be available for your
+platform; refer to the kconfig menus. File contents encryption can
+also be configured to use inline encryption hardware instead of the
+kernel crypto API (see `Inline encryption support`_); in that case,
+the file contents mode doesn't need to supported in the kernel crypto
+API, but the filenames mode still does.
+
+- AES-256-XTS and AES-256-CTS-CBC
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+
+- AES-256-HCTR2
+ - Mandatory:
+ - CONFIG_CRYPTO_HCTR2
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - arm64: CONFIG_CRYPTO_POLYVAL_ARM64_CE
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+ - x86: CONFIG_CRYPTO_POLYVAL_CLMUL_NI
+
+- Adiantum
+ - Mandatory:
+ - CONFIG_CRYPTO_ADIANTUM
+ - Recommended:
+ - arm32: CONFIG_CRYPTO_CHACHA20_NEON
+ - arm32: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - arm64: CONFIG_CRYPTO_CHACHA20_NEON
+ - arm64: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - x86: CONFIG_CRYPTO_CHACHA20_X86_64
+ - x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
+ - x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
+
+- AES-128-CBC-ESSIV and AES-128-CTS-CBC:
+ - Mandatory:
+ - CONFIG_CRYPTO_ESSIV
+ - CONFIG_CRYPTO_SHA256 or another SHA-256 implementation
+ - Recommended:
+ - AES-CBC acceleration
+
+fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512
+acceleration is recommended:
+
+- SHA-512
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE
+ - x86: CONFIG_CRYPTO_SHA512_SSSE3
Contents encryption
-------------------
-For file contents, each filesystem block is encrypted independently.
-Starting from Linux kernel 5.5, encryption of filesystems with block
-size less than system's page size is supported.
-
-Each block's IV is set to the logical block number within the file as
-a little endian number, except that:
-
-- With CBC mode encryption, ESSIV is also used. Specifically, each IV
- is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
- of the file's data encryption key.
-
-- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV.
- Currently this is only allowed with the Adiantum encryption mode.
-
-- With `IV_INO_LBLK_64 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- (which is also limited to 32 bits) is placed in bits 32-63.
-
-Note that because file logical block numbers are included in the IVs,
-filesystems must enforce that blocks are never shifted around within
-encrypted files, e.g. via "collapse range" or "insert range".
+For contents encryption, each file's contents is divided into "data
+units". Each data unit is encrypted independently. The IV for each
+data unit incorporates the zero-based index of the data unit within
+the file. This ensures that each data unit within a file is encrypted
+differently, which is essential to prevent leaking information.
+
+Note: the encryption depending on the offset into the file means that
+operations like "collapse range" and "insert range" that rearrange the
+extent mapping of files are not supported on encrypted files.
+
+There are two cases for the sizes of the data units:
+
+* Fixed-size data units. This is how all filesystems other than UBIFS
+ work. A file's data units are all the same size; the last data unit
+ is zero-padded if needed. By default, the data unit size is equal
+ to the filesystem block size. On some filesystems, users can select
+ a sub-block data unit size via the ``log2_data_unit_size`` field of
+ the encryption policy; see `FS_IOC_SET_ENCRYPTION_POLICY`_.
+
+* Variable-size data units. This is what UBIFS does. Each "UBIFS
+ data node" is treated as a crypto data unit. Each contains variable
+ length, possibly compressed data, zero-padded to the next 16-byte
+ boundary. Users cannot select a sub-block data unit size on UBIFS.
+
+In the case of compression + encryption, the compressed data is
+encrypted. UBIFS compression works as described above. f2fs
+compression works a bit differently; it compresses a number of
+filesystem blocks into a smaller number of filesystem blocks.
+Therefore a f2fs-compressed file still uses fixed-size data units, and
+it is encrypted in a similar way to a file containing holes.
+
+As mentioned in `Key hierarchy`_, the default encryption setting uses
+per-file keys. In this case, the IV for each data unit is simply the
+index of the data unit in the file. However, users can select an
+encryption setting that does not use per-file keys. For these, some
+kind of file identifier is incorporated into the IVs as follows:
+
+- With `DIRECT_KEY policies`_, the data unit index is placed in bits
+ 0-63 of the IV, and the file's nonce is placed in bits 64-191.
+
+- With `IV_INO_LBLK_64 policies`_, the data unit index is placed in
+ bits 0-31 of the IV, and the file's inode number is placed in bits
+ 32-63. This setting is only allowed when data unit indices and
+ inode numbers fit in 32 bits.
+
+- With `IV_INO_LBLK_32 policies`_, the file's inode number is hashed
+ and added to the data unit index. The resulting value is truncated
+ to 32 bits and placed in bits 0-31 of the IV. This setting is only
+ allowed when data unit indices and inode numbers fit in 32 bits.
+
+The byte order of the IV is always little endian.
+
+If the user selects FSCRYPT_MODE_AES_128_CBC for the contents mode, an
+ESSIV layer is automatically included. In this case, before the IV is
+passed to AES-128-CBC, it is encrypted with AES-256 where the AES-256
+key is the SHA-256 hash of the file's contents encryption key.
Filenames encryption
--------------------
@@ -386,11 +521,11 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames
-share a common prefix at least as long as the cipher block size (16
-bytes for AES), the corresponding encrypted filenames will also share
-a common prefix. This is undesirable. Adiantum does not have this
-weakness, as it is a wide-block encryption mode.
+With CTS-CBC, the IV reuse means that when the plaintext filenames share a
+common prefix at least as long as the cipher block size (16 bytes for AES), the
+corresponding encrypted filenames will also share a common prefix. This is
+undesirable. Adiantum and HCTR2 do not have this weakness, as they are
+wide-block encryption modes.
All supported filenames encryption modes accept any plaintext length
>= 16 bytes; cipher block alignment is not required. However,
@@ -418,9 +553,9 @@ FS_IOC_SET_ENCRYPTION_POLICY
The FS_IOC_SET_ENCRYPTION_POLICY ioctl sets an encryption policy on an
empty directory or verifies that a directory or regular file already
-has the specified encryption policy. It takes in a pointer to a
-:c:type:`struct fscrypt_policy_v1` or a :c:type:`struct
-fscrypt_policy_v2`, defined as follows::
+has the specified encryption policy. It takes in a pointer to
+struct fscrypt_policy_v1 or struct fscrypt_policy_v2, defined as
+follows::
#define FSCRYPT_POLICY_V1 0
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
@@ -440,23 +575,31 @@ fscrypt_policy_v2`, defined as follows::
__u8 contents_encryption_mode;
__u8 filenames_encryption_mode;
__u8 flags;
- __u8 __reserved[4];
+ __u8 log2_data_unit_size;
+ __u8 __reserved[3];
__u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
};
This structure must be initialized as follows:
-- ``version`` must be FSCRYPT_POLICY_V1 (0) if the struct is
- :c:type:`fscrypt_policy_v1` or FSCRYPT_POLICY_V2 (2) if the struct
- is :c:type:`fscrypt_policy_v2`. (Note: we refer to the original
- policy version as "v1", though its version code is really 0.) For
- new encrypted directories, use v2 policies.
+- ``version`` must be FSCRYPT_POLICY_V1 (0) if
+ struct fscrypt_policy_v1 is used or FSCRYPT_POLICY_V2 (2) if
+ struct fscrypt_policy_v2 is used. (Note: we refer to the original
+ policy version as "v1", though its version code is really 0.)
+ For new encrypted directories, use v2 policies.
- ``contents_encryption_mode`` and ``filenames_encryption_mode`` must
be set to constants from ``<linux/fscrypt.h>`` which identify the
encryption modes to use. If unsure, use FSCRYPT_MODE_AES_256_XTS
(1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS
- (4) for ``filenames_encryption_mode``.
+ (4) for ``filenames_encryption_mode``. For details, see `Encryption
+ modes and usage`_.
+
+ v1 encryption policies only support three combinations of modes:
+ (FSCRYPT_MODE_AES_256_XTS, FSCRYPT_MODE_AES_256_CTS),
+ (FSCRYPT_MODE_AES_128_CBC, FSCRYPT_MODE_AES_128_CTS), and
+ (FSCRYPT_MODE_ADIANTUM, FSCRYPT_MODE_ADIANTUM). v2 policies support
+ all combinations documented in `Supported modes`_.
- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
@@ -465,8 +608,38 @@ This structure must be initialized as follows:
(0x3).
- FSCRYPT_POLICY_FLAG_DIRECT_KEY: See `DIRECT_KEY policies`_.
- FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64: See `IV_INO_LBLK_64
- policies`_. This is mutually exclusive with DIRECT_KEY and is not
- supported on v1 policies.
+ policies`_.
+ - FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32: See `IV_INO_LBLK_32
+ policies`_.
+
+ v1 encryption policies only support the PAD_* and DIRECT_KEY flags.
+ The other flags are only supported by v2 encryption policies.
+
+ The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
+ mutually exclusive.
+
+- ``log2_data_unit_size`` is the log2 of the data unit size in bytes,
+ or 0 to select the default data unit size. The data unit size is
+ the granularity of file contents encryption. For example, setting
+ ``log2_data_unit_size`` to 12 causes file contents be passed to the
+ underlying encryption algorithm (such as AES-256-XTS) in 4096-byte
+ data units, each with its own IV.
+
+ Not all filesystems support setting ``log2_data_unit_size``. ext4
+ and f2fs support it since Linux v6.7. On filesystems that support
+ it, the supported nonzero values are 9 through the log2 of the
+ filesystem block size, inclusively. The default value of 0 selects
+ the filesystem block size.
+
+ The main use case for ``log2_data_unit_size`` is for selecting a
+ data unit size smaller than the filesystem block size for
+ compatibility with inline encryption hardware that only supports
+ smaller data unit sizes. ``/sys/block/$disk/queue/crypto/`` may be
+ useful for checking which data unit sizes are supported by a
+ particular system's inline encryption hardware.
+
+ Leave this field zeroed unless you are certain you need it. Using
+ an unnecessarily small data unit size reduces performance.
- For v2 encryption policies, ``__reserved`` must be zeroed.
@@ -483,9 +656,9 @@ This structure must be initialized as follows:
replaced with ``master_key_identifier``, which is longer and cannot
be arbitrarily chosen. Instead, the key must first be added using
`FS_IOC_ADD_ENCRYPTION_KEY`_. Then, the ``key_spec.u.identifier``
- the kernel returned in the :c:type:`struct fscrypt_add_key_arg` must
- be used as the ``master_key_identifier`` in the :c:type:`struct
- fscrypt_policy_v2`.
+ the kernel returned in the struct fscrypt_add_key_arg must
+ be used as the ``master_key_identifier`` in
+ struct fscrypt_policy_v2.
If the file is not yet encrypted, then FS_IOC_SET_ENCRYPTION_POLICY
verifies that the file is an empty directory. If so, the specified
@@ -565,7 +738,7 @@ FS_IOC_GET_ENCRYPTION_POLICY_EX
The FS_IOC_GET_ENCRYPTION_POLICY_EX ioctl retrieves the encryption
policy, if any, for a directory or regular file. No additional
permissions are required beyond the ability to open the file. It
-takes in a pointer to a :c:type:`struct fscrypt_get_policy_ex_arg`,
+takes in a pointer to struct fscrypt_get_policy_ex_arg,
defined as follows::
struct fscrypt_get_policy_ex_arg {
@@ -612,9 +785,8 @@ The FS_IOC_GET_ENCRYPTION_POLICY ioctl can also retrieve the
encryption policy, if any, for a directory or regular file. However,
unlike `FS_IOC_GET_ENCRYPTION_POLICY_EX`_,
FS_IOC_GET_ENCRYPTION_POLICY only supports the original policy
-version. It takes in a pointer directly to a :c:type:`struct
-fscrypt_policy_v1` rather than a :c:type:`struct
-fscrypt_get_policy_ex_arg`.
+version. It takes in a pointer directly to struct fscrypt_policy_v1
+rather than struct fscrypt_get_policy_ex_arg.
The error codes for FS_IOC_GET_ENCRYPTION_POLICY are the same as those
for FS_IOC_GET_ENCRYPTION_POLICY_EX, except that
@@ -655,8 +827,7 @@ the filesystem, making all files on the filesystem which were
encrypted using that key appear "unlocked", i.e. in plaintext form.
It can be executed on any file or directory on the target filesystem,
but using the filesystem's root directory is recommended. It takes in
-a pointer to a :c:type:`struct fscrypt_add_key_arg`, defined as
-follows::
+a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_add_key_arg {
struct fscrypt_key_specifier key_spec;
@@ -685,17 +856,16 @@ follows::
__u8 raw[];
};
-:c:type:`struct fscrypt_add_key_arg` must be zeroed, then initialized
+struct fscrypt_add_key_arg must be zeroed, then initialized
as follows:
- If the key is being added for use by v1 encryption policies, then
``key_spec.type`` must contain FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR, and
``key_spec.u.descriptor`` must contain the descriptor of the key
being added, corresponding to the value in the
- ``master_key_descriptor`` field of :c:type:`struct
- fscrypt_policy_v1`. To add this type of key, the calling process
- must have the CAP_SYS_ADMIN capability in the initial user
- namespace.
+ ``master_key_descriptor`` field of struct fscrypt_policy_v1.
+ To add this type of key, the calling process must have the
+ CAP_SYS_ADMIN capability in the initial user namespace.
Alternatively, if the key is being added for use by v2 encryption
policies, then ``key_spec.type`` must contain
@@ -712,12 +882,13 @@ as follows:
- ``key_id`` is 0 if the raw key is given directly in the ``raw``
field. Otherwise ``key_id`` is the ID of a Linux keyring key of
- type "fscrypt-provisioning" whose payload is a :c:type:`struct
- fscrypt_provisioning_key_payload` whose ``raw`` field contains the
- raw key and whose ``type`` field matches ``key_spec.type``. Since
- ``raw`` is variable-length, the total size of this key's payload
- must be ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the
- raw key size. The process must have Search permission on this key.
+ type "fscrypt-provisioning" whose payload is
+ struct fscrypt_provisioning_key_payload whose ``raw`` field contains
+ the raw key and whose ``type`` field matches ``key_spec.type``.
+ Since ``raw`` is variable-length, the total size of this key's
+ payload must be ``sizeof(struct fscrypt_provisioning_key_payload)``
+ plus the raw key size. The process must have Search permission on
+ this key.
Most users should leave this 0 and specify the raw key directly.
The support for specifying a Linux keyring key is intended mainly to
@@ -835,8 +1006,8 @@ The FS_IOC_REMOVE_ENCRYPTION_KEY ioctl removes a claim to a master
encryption key from the filesystem, and possibly removes the key
itself. It can be executed on any file or directory on the target
filesystem, but using the filesystem's root directory is recommended.
-It takes in a pointer to a :c:type:`struct fscrypt_remove_key_arg`,
-defined as follows::
+It takes in a pointer to struct fscrypt_remove_key_arg, defined
+as follows::
struct fscrypt_remove_key_arg {
struct fscrypt_key_specifier key_spec;
@@ -931,8 +1102,8 @@ FS_IOC_GET_ENCRYPTION_KEY_STATUS
The FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl retrieves the status of a
master encryption key. It can be executed on any file or directory on
the target filesystem, but using the filesystem's root directory is
-recommended. It takes in a pointer to a :c:type:`struct
-fscrypt_get_key_status_arg`, defined as follows::
+recommended. It takes in a pointer to
+struct fscrypt_get_key_status_arg, defined as follows::
struct fscrypt_get_key_status_arg {
/* input */
@@ -963,8 +1134,8 @@ The caller must zero all input fields, then fill in ``key_spec``:
On success, 0 is returned and the kernel fills in the output fields:
- ``status`` indicates whether the key is absent, present, or
- incompletely removed. Incompletely removed means that the master
- secret has been removed, but some files are still in use; i.e.,
+ incompletely removed. Incompletely removed means that removal has
+ been initiated, but some files are still in use; i.e.,
`FS_IOC_REMOVE_ENCRYPTION_KEY`_ returned 0 but set the informational
status flag FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY.
@@ -1024,8 +1195,8 @@ astute users may notice some differences in behavior:
may be used to overwrite the source files but isn't guaranteed to be
effective on all filesystems and storage devices.
-- Direct I/O is not supported on encrypted files. Attempts to use
- direct I/O on such files will fall back to buffered I/O.
+- Direct I/O is supported on encrypted files only under some
+ circumstances. For details, see `Direct I/O support`_.
- The fallocate operations FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE are not supported on encrypted files and will
@@ -1040,11 +1211,6 @@ astute users may notice some differences in behavior:
- DAX (Direct Access) is not supported on encrypted files.
-- The st_size of an encrypted symlink will not necessarily give the
- length of the symlink target as required by POSIX. It will actually
- give the length of the ciphertext, which will be slightly longer
- than the plaintext due to NUL-padding and an extra 2-byte overhead.
-
- The maximum length of an encrypted symlink is 2 bytes shorter than
the maximum length of an unencrypted symlink. For example, on an
EXT4 filesystem with a 4K block size, unencrypted symlinks can be up
@@ -1117,23 +1283,88 @@ where applications may later write sensitive data. It is recommended
that systems implementing a form of "verified boot" take advantage of
this by validating all top-level encryption policies prior to access.
+Inline encryption support
+=========================
+
+By default, fscrypt uses the kernel crypto API for all cryptographic
+operations (other than HKDF, which fscrypt partially implements
+itself). The kernel crypto API supports hardware crypto accelerators,
+but only ones that work in the traditional way where all inputs and
+outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can
+take advantage of such hardware, but the traditional acceleration
+model isn't particularly efficient and fscrypt hasn't been optimized
+for it.
+
+Instead, many newer systems (especially mobile SoCs) have *inline
+encryption hardware* that can encrypt/decrypt data while it is on its
+way to/from the storage device. Linux supports inline encryption
+through a set of extensions to the block layer called *blk-crypto*.
+blk-crypto allows filesystems to attach encryption contexts to bios
+(I/O requests) to specify how the data will be encrypted or decrypted
+in-line. For more information about blk-crypto, see
+:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`.
+
+On supported filesystems (currently ext4 and f2fs), fscrypt can use
+blk-crypto instead of the kernel crypto API to encrypt/decrypt file
+contents. To enable this, set CONFIG_FS_ENCRYPTION_INLINE_CRYPT=y in
+the kernel configuration, and specify the "inlinecrypt" mount option
+when mounting the filesystem.
+
+Note that the "inlinecrypt" mount option just specifies to use inline
+encryption when possible; it doesn't force its use. fscrypt will
+still fall back to using the kernel crypto API on files where the
+inline encryption hardware doesn't have the needed crypto capabilities
+(e.g. support for the needed encryption algorithm and data unit size)
+and where blk-crypto-fallback is unusable. (For blk-crypto-fallback
+to be usable, it must be enabled in the kernel configuration with
+CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y.)
+
+Currently fscrypt always uses the filesystem block size (which is
+usually 4096 bytes) as the data unit size. Therefore, it can only use
+inline encryption hardware that supports that data unit size.
+
+Inline encryption doesn't affect the ciphertext or other aspects of
+the on-disk format, so users may freely switch back and forth between
+using "inlinecrypt" and not using "inlinecrypt".
+
+Direct I/O support
+==================
+
+For direct I/O on an encrypted file to work, the following conditions
+must be met (in addition to the conditions for direct I/O on an
+unencrypted file):
+
+* The file must be using inline encryption. Usually this means that
+ the filesystem must be mounted with ``-o inlinecrypt`` and inline
+ encryption hardware must be present. However, a software fallback
+ is also available. For details, see `Inline encryption support`_.
+
+* The I/O request must be fully aligned to the filesystem block size.
+ This means that the file position the I/O is targeting, the lengths
+ of all I/O segments, and the memory addresses of all I/O buffers
+ must be multiples of this value. Note that the filesystem block
+ size may be greater than the logical block size of the block device.
+
+If either of the above conditions is not met, then direct I/O on the
+encrypted file will fall back to buffered I/O.
+
Implementation details
======================
Encryption context
------------------
-An encryption policy is represented on-disk by a :c:type:`struct
-fscrypt_context_v1` or a :c:type:`struct fscrypt_context_v2`. It is
-up to individual filesystems to decide where to store it, but normally
-it would be stored in a hidden extended attribute. It should *not* be
+An encryption policy is represented on-disk by
+struct fscrypt_context_v1 or struct fscrypt_context_v2. It is up to
+individual filesystems to decide where to store it, but normally it
+would be stored in a hidden extended attribute. It should *not* be
exposed by the xattr-related system calls such as getxattr() and
setxattr() because of the special semantics of the encryption xattr.
(In particular, there would be much confusion if an encryption policy
were to be added to or removed from anything other than an empty
directory.) These structs are defined as follows::
- #define FS_KEY_DERIVATION_NONCE_SIZE 16
+ #define FSCRYPT_FILE_NONCE_SIZE 16
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
struct fscrypt_context_v1 {
@@ -1142,7 +1373,7 @@ directory.) These structs are defined as follows::
u8 filenames_encryption_mode;
u8 flags;
u8 master_key_descriptor[FSCRYPT_KEY_DESCRIPTOR_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
#define FSCRYPT_KEY_IDENTIFIER_SIZE 16
@@ -1151,9 +1382,10 @@ directory.) These structs are defined as follows::
u8 contents_encryption_mode;
u8 filenames_encryption_mode;
u8 flags;
- u8 __reserved[4];
+ u8 log2_data_unit_size;
+ u8 __reserved[3];
u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
The context structs contain the same information as the corresponding
@@ -1166,10 +1398,17 @@ keys`_ and `DIRECT_KEY policies`_.
Data path changes
-----------------
-For the read path (->readpage()) of regular files, filesystems can
+When inline encryption is used, filesystems just need to associate
+encryption contexts with bios to specify how the block layer or the
+inline encryption hardware will encrypt/decrypt the file contents.
+
+When inline encryption isn't used, filesystems must encrypt/decrypt
+the file contents themselves, as described below:
+
+For the read path (->read_folio()) of regular files, filesystems can
read the ciphertext into the page cache and decrypt it in-place. The
-page lock must be held until decryption has finished, to prevent the
-page from becoming visible to userspace prematurely.
+folio lock must be held until decryption has finished, to prevent the
+folio from becoming visible to userspace prematurely.
For the write path (->writepage()) of regular files, filesystems
cannot encrypt data in-place in the page cache, since the cached
@@ -1200,20 +1439,20 @@ the user-supplied name to get the ciphertext.
Lookups without the key are more complicated. The raw ciphertext may
contain the ``\0`` and ``/`` characters, which are illegal in
-filenames. Therefore, readdir() must base64-encode the ciphertext for
-presentation. For most filenames, this works fine; on ->lookup(), the
-filesystem just base64-decodes the user-supplied name to get back to
-the raw ciphertext.
+filenames. Therefore, readdir() must base64url-encode the ciphertext
+for presentation. For most filenames, this works fine; on ->lookup(),
+the filesystem just base64url-decodes the user-supplied name to get
+back to the raw ciphertext.
-However, for very long filenames, base64 encoding would cause the
+However, for very long filenames, base64url encoding would cause the
filename length to exceed NAME_MAX. To prevent this, readdir()
actually presents long filenames in an abbreviated form which encodes
a strong "hash" of the ciphertext filename, along with the optional
filesystem-specific hash(es) needed for directory lookups. This
allows the filesystem to still, with a high degree of confidence, map
the filename given in ->lookup() back to a particular directory entry
-that was previously listed by readdir(). See :c:type:`struct
-fscrypt_nokey_name` in the source for more details.
+that was previously listed by readdir(). See
+struct fscrypt_nokey_name in the source for more details.
Note that the precise way that filenames are presented to userspace
without the key is subject to change in the future. It is only meant
@@ -1225,11 +1464,14 @@ Tests
To test fscrypt, use xfstests, which is Linux's de facto standard
filesystem test suite. First, run all the tests in the "encrypt"
-group on the relevant filesystem(s). For example, to test ext4 and
+group on the relevant filesystem(s). One can also run the tests
+with the 'inlinecrypt' mount option to test the implementation for
+inline encryption support. For example, to test ext4 and
f2fs encryption using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
kvm-xfstests -c ext4,f2fs -g encrypt
+ kvm-xfstests -c ext4,f2fs -g encrypt -m inlinecrypt
UBIFS encryption can also be tested this way, but it should be done in
a separate command, and it takes some time for kvm-xfstests to set up
@@ -1251,6 +1493,7 @@ This tests the encrypted I/O paths more thoroughly. To do this with
kvm-xfstests, use the "encrypt" filesystem configuration::
kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
Because this runs many more tests than "-g encrypt" does, it takes
much longer to run; so also consider using `gce-xfstests
@@ -1258,3 +1501,4 @@ much longer to run; so also consider using `gce-xfstests
instead of kvm-xfstests::
gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index a95536b6443c..13e4b18e5dbb 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -11,9 +11,9 @@ Introduction
fs-verity (``fs/verity/``) is a support layer that filesystems can
hook into to support transparent integrity and authenticity protection
-of read-only files. Currently, it is supported by the ext4 and f2fs
-filesystems. Like fscrypt, not too much filesystem-specific code is
-needed to support fs-verity.
+of read-only files. Currently, it is supported by the ext4, f2fs, and
+btrfs filesystems. Like fscrypt, not too much filesystem-specific
+code is needed to support fs-verity.
fs-verity is similar to `dm-verity
<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
@@ -27,9 +27,9 @@ automatically verified against the file's Merkle tree. Reads of any
corrupted data, including mmap reads, will fail.
Userspace can use another ioctl to retrieve the root hash (actually
-the "file measurement", which is a hash that includes the root hash)
-that fs-verity is enforcing for the file. This ioctl executes in
-constant time, regardless of the file size.
+the "fs-verity file digest", which is a hash that includes the Merkle
+tree root hash) that fs-verity is enforcing for the file. This ioctl
+executes in constant time, regardless of the file size.
fs-verity is essentially a way to hash a file in constant time,
subject to the caveat that reads which would violate the hash will
@@ -38,20 +38,14 @@ fail at runtime.
Use cases
=========
-By itself, the base fs-verity feature only provides integrity
-protection, i.e. detection of accidental (non-malicious) corruption.
+By itself, fs-verity only provides integrity protection, i.e.
+detection of accidental (non-malicious) corruption.
However, because fs-verity makes retrieving the file hash extremely
efficient, it's primarily meant to be used as a tool to support
authentication (detection of malicious modifications) or auditing
(logging file hashes before use).
-Trusted userspace code (e.g. operating system code running on a
-read-only partition that is itself authenticated by dm-verity) can
-authenticate the contents of an fs-verity file by using the
-`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
-digital signature of it.
-
A standard file hash could be used instead of fs-verity. However,
this is inefficient if the file is large and only a small portion may
be accessed. This is often the case for Android application package
@@ -69,13 +63,31 @@ still be used on read-only filesystems. fs-verity is for files that
must live on a read-write filesystem because they are independently
updated and potentially user-installed, so dm-verity cannot be used.
-The base fs-verity feature is a hashing mechanism only; actually
-authenticating the files is up to userspace. However, to meet some
-users' needs, fs-verity optionally supports a simple signature
-verification mechanism where users can configure the kernel to require
-that all fs-verity files be signed by a key loaded into a keyring; see
-`Built-in signature verification`_. Support for fs-verity file hashes
-in IMA (Integrity Measurement Architecture) policies is also planned.
+fs-verity does not mandate a particular scheme for authenticating its
+file hashes. (Similarly, dm-verity does not mandate a particular
+scheme for authenticating its block device root hashes.) Options for
+authenticating fs-verity file hashes include:
+
+- Trusted userspace code. Often, the userspace code that accesses
+ files can be trusted to authenticate them. Consider e.g. an
+ application that wants to authenticate data files before using them,
+ or an application loader that is part of the operating system (which
+ is already authenticated in a different way, such as by being loaded
+ from a read-only partition that uses dm-verity) and that wants to
+ authenticate applications before loading them. In these cases, this
+ trusted userspace code can authenticate a file's contents by
+ retrieving its fs-verity digest using `FS_IOC_MEASURE_VERITY`_, then
+ verifying a signature of it using any userspace cryptographic
+ library that supports digital signatures.
+
+- Integrity Measurement Architecture (IMA). IMA supports fs-verity
+ file digests as an alternative to its traditional full file digests.
+ "IMA appraisal" enforces that files contain a valid, matching
+ signature in their "security.ima" extended attribute, as controlled
+ by the IMA policy. For more information, see the IMA documentation.
+
+- Trusted userspace code in combination with `Built-in signature
+ verification`_. This approach should be used only with great care.
User API
========
@@ -84,7 +96,7 @@ FS_IOC_ENABLE_VERITY
--------------------
The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file. It takes
-in a pointer to a :c:type:`struct fsverity_enable_arg`, defined as
+in a pointer to a struct fsverity_enable_arg, defined as
follows::
struct fsverity_enable_arg {
@@ -100,29 +112,31 @@ follows::
};
This structure contains the parameters of the Merkle tree to build for
-the file, and optionally contains a signature. It must be initialized
-as follows:
+the file. It must be initialized as follows:
- ``version`` must be 1.
- ``hash_algorithm`` must be the identifier for the hash algorithm to
use for the Merkle tree, such as FS_VERITY_HASH_ALG_SHA256. See
``include/uapi/linux/fsverity.h`` for the list of possible values.
-- ``block_size`` must be the Merkle tree block size. Currently, this
- must be equal to the system page size, which is usually 4096 bytes.
- Other sizes may be supported in the future. This value is not
- necessarily the same as the filesystem block size.
+- ``block_size`` is the Merkle tree block size, in bytes. In Linux
+ v6.3 and later, this can be any power of 2 between (inclusively)
+ 1024 and the minimum of the system page size and the filesystem
+ block size. In earlier versions, the page size was the only allowed
+ value.
- ``salt_size`` is the size of the salt in bytes, or 0 if no salt is
provided. The salt is a value that is prepended to every hashed
block; it can be used to personalize the hashing for a particular
file or device. Currently the maximum salt size is 32 bytes.
- ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
provided.
-- ``sig_size`` is the size of the signature in bytes, or 0 if no
- signature is provided. Currently the signature is (somewhat
- arbitrarily) limited to 16128 bytes. See `Built-in signature
- verification`_ for more information.
-- ``sig_ptr`` is the pointer to the signature, or NULL if no
- signature is provided.
+- ``sig_size`` is the size of the builtin signature in bytes, or 0 if no
+ builtin signature is provided. Currently the builtin signature is
+ (somewhat arbitrarily) limited to 16128 bytes.
+- ``sig_ptr`` is the pointer to the builtin signature, or NULL if no
+ builtin signature is provided. A builtin signature is only needed
+ if the `Built-in signature verification`_ feature is being used. It
+ is not needed for IMA appraisal, and it is not needed if the file
+ signature is being handled entirely in userspace.
- All reserved fields must be zeroed.
FS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
@@ -146,19 +160,20 @@ fatal signal), no changes are made to the file.
FS_IOC_ENABLE_VERITY can fail with the following errors:
- ``EACCES``: the process does not have write access to the file
-- ``EBADMSG``: the signature is malformed
+- ``EBADMSG``: the builtin signature is malformed
- ``EBUSY``: this ioctl is already running on the file
- ``EEXIST``: the file already has verity enabled
- ``EFAULT``: the caller provided inaccessible memory
+- ``EFBIG``: the file is too large to enable verity on
- ``EINTR``: the operation was interrupted by a fatal signal
- ``EINVAL``: unsupported version, hash algorithm, or block size; or
reserved bits are set; or the file descriptor refers to neither a
regular file nor a directory.
- ``EISDIR``: the file descriptor refers to a directory
-- ``EKEYREJECTED``: the signature doesn't match the file
-- ``EMSGSIZE``: the salt or signature is too long
-- ``ENOKEY``: the fs-verity keyring doesn't contain the certificate
- needed to verify the signature
+- ``EKEYREJECTED``: the builtin signature doesn't match the file
+- ``EMSGSIZE``: the salt or builtin signature is too long
+- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
+ needed to verify the builtin signature
- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
available in the kernel's crypto API as currently configured (e.g.
for SHA-512, missing CONFIG_CRYPTO_SHA512).
@@ -167,8 +182,8 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
support; or the filesystem superblock has not had the 'verity'
feature enabled on it; or the filesystem does not support fs-verity
on this file. (See `Filesystem support`_.)
-- ``EPERM``: the file is append-only; or, a signature is required and
- one was not provided.
+- ``EPERM``: the file is append-only; or, a builtin signature is
+ required and one was not provided.
- ``EROFS``: the filesystem is read-only
- ``ETXTBSY``: someone has the file open for writing. This can be the
caller's file descriptor, another open file descriptor, or the file
@@ -177,9 +192,10 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
FS_IOC_MEASURE_VERITY
---------------------
-The FS_IOC_MEASURE_VERITY ioctl retrieves the measurement of a verity
-file. The file measurement is a digest that cryptographically
-identifies the file contents that are being enforced on reads.
+The FS_IOC_MEASURE_VERITY ioctl retrieves the digest of a verity file.
+The fs-verity file digest is a cryptographic digest that identifies
+the file contents that are being enforced on reads; it is computed via
+a Merkle tree and is different from a traditional full-file digest.
This ioctl takes in a pointer to a variable-length structure::
@@ -197,7 +213,7 @@ On success, 0 is returned and the kernel fills in the structure as
follows:
- ``digest_algorithm`` will be the hash algorithm used for the file
- measurement. It will match ``fsverity_enable_arg::hash_algorithm``.
+ digest. It will match ``fsverity_enable_arg::hash_algorithm``.
- ``digest_size`` will be the size of the digest in bytes, e.g. 32
for SHA-256. (This can be redundant with ``digest_algorithm``.)
- ``digest`` will be the actual bytes of the digest.
@@ -216,6 +232,82 @@ FS_IOC_MEASURE_VERITY can fail with the following errors:
- ``EOVERFLOW``: the digest is longer than the specified
``digest_size`` bytes. Try providing a larger buffer.
+FS_IOC_READ_VERITY_METADATA
+---------------------------
+
+The FS_IOC_READ_VERITY_METADATA ioctl reads verity metadata from a
+verity file. This ioctl is available since Linux v5.12.
+
+This ioctl allows writing a server program that takes a verity file
+and serves it to a client program, such that the client can do its own
+fs-verity compatible verification of the file. This only makes sense
+if the client doesn't trust the server and if the server needs to
+provide the storage for the client.
+
+This is a fairly specialized use case, and most fs-verity users won't
+need this ioctl.
+
+This ioctl takes in a pointer to the following structure::
+
+ #define FS_VERITY_METADATA_TYPE_MERKLE_TREE 1
+ #define FS_VERITY_METADATA_TYPE_DESCRIPTOR 2
+ #define FS_VERITY_METADATA_TYPE_SIGNATURE 3
+
+ struct fsverity_read_metadata_arg {
+ __u64 metadata_type;
+ __u64 offset;
+ __u64 length;
+ __u64 buf_ptr;
+ __u64 __reserved;
+ };
+
+``metadata_type`` specifies the type of metadata to read:
+
+- ``FS_VERITY_METADATA_TYPE_MERKLE_TREE`` reads the blocks of the
+ Merkle tree. The blocks are returned in order from the root level
+ to the leaf level. Within each level, the blocks are returned in
+ the same order that their hashes are themselves hashed.
+ See `Merkle tree`_ for more information.
+
+- ``FS_VERITY_METADATA_TYPE_DESCRIPTOR`` reads the fs-verity
+ descriptor. See `fs-verity descriptor`_.
+
+- ``FS_VERITY_METADATA_TYPE_SIGNATURE`` reads the builtin signature
+ which was passed to FS_IOC_ENABLE_VERITY, if any. See `Built-in
+ signature verification`_.
+
+The semantics are similar to those of ``pread()``. ``offset``
+specifies the offset in bytes into the metadata item to read from, and
+``length`` specifies the maximum number of bytes to read from the
+metadata item. ``buf_ptr`` is the pointer to the buffer to read into,
+cast to a 64-bit integer. ``__reserved`` must be 0. On success, the
+number of bytes read is returned. 0 is returned at the end of the
+metadata item. The returned length may be less than ``length``, for
+example if the ioctl is interrupted.
+
+The metadata returned by FS_IOC_READ_VERITY_METADATA isn't guaranteed
+to be authenticated against the file digest that would be returned by
+`FS_IOC_MEASURE_VERITY`_, as the metadata is expected to be used to
+implement fs-verity compatible verification anyway (though absent a
+malicious disk, the metadata will indeed match). E.g. to implement
+this ioctl, the filesystem is allowed to just read the Merkle tree
+blocks from disk without actually verifying the path to the root node.
+
+FS_IOC_READ_VERITY_METADATA can fail with the following errors:
+
+- ``EFAULT``: the caller provided inaccessible memory
+- ``EINTR``: the ioctl was interrupted before any data was read
+- ``EINVAL``: reserved fields were set, or ``offset + length``
+ overflowed
+- ``ENODATA``: the file is not a verity file, or
+ FS_VERITY_METADATA_TYPE_SIGNATURE was requested but the file doesn't
+ have a builtin signature
+- ``ENOTTY``: this type of filesystem does not implement fs-verity, or
+ this ioctl is not yet implemented on it
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+ support, or the filesystem superblock has not had the 'verity'
+ feature enabled on it. (See `Filesystem support`_.)
+
FS_IOC_GETFLAGS
---------------
@@ -234,6 +326,8 @@ the file has fs-verity enabled. This can perform better than
FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
opening the file, and opening verity files can be expensive.
+.. _accessing_verity_files:
+
Accessing verity files
======================
@@ -257,25 +351,24 @@ non-verity one, with the following exceptions:
with EIO (for read()) or SIGBUS (for mmap() reads).
- If the sysctl "fs.verity.require_signatures" is set to 1 and the
- file's verity measurement is not signed by a key in the fs-verity
- keyring, then opening the file will fail. See `Built-in signature
- verification`_.
+ file is not signed by a key in the ".fs-verity" keyring, then
+ opening the file will fail. See `Built-in signature verification`_.
Direct access to the Merkle tree is not supported. Therefore, if a
verity file is copied, or is backed up and restored, then it will lose
its "verity"-ness. fs-verity is primarily meant for files like
executables that are managed by a package manager.
-File measurement computation
-============================
+File digest computation
+=======================
This section describes how fs-verity hashes the file contents using a
-Merkle tree to produce the "file measurement" which cryptographically
-identifies the file contents. This algorithm is the same for all
-filesystems that support fs-verity.
+Merkle tree to produce the digest which cryptographically identifies
+the file contents. This algorithm is the same for all filesystems
+that support fs-verity.
Userspace only needs to be aware of this algorithm if it needs to
-compute the file measurement itself, e.g. in order to sign the file.
+compute fs-verity file digests itself, e.g. in order to sign files.
.. _fsverity_merkle_tree:
@@ -325,74 +418,128 @@ can't a distinguish a large file from a small second file whose data
is exactly the top-level hash block of the first file. Ambiguities
also arise from the convention of padding to the next block boundary.
-To solve this problem, the verity file measurement is actually
-computed as a hash of the following structure, which contains the
-Merkle tree root hash as well as other fields such as the file size::
+To solve this problem, the fs-verity file digest is actually computed
+as a hash of the following structure, which contains the Merkle tree
+root hash as well as other fields such as the file size::
struct fsverity_descriptor {
__u8 version; /* must be 1 */
__u8 hash_algorithm; /* Merkle tree hash algorithm */
__u8 log_blocksize; /* log2 of size of data and tree blocks */
__u8 salt_size; /* size of salt in bytes; 0 if none */
- __le32 sig_size; /* must be 0 */
+ __le32 __reserved_0x04; /* must be 0 */
__le64 data_size; /* size of file the Merkle tree is built over */
__u8 root_hash[64]; /* Merkle tree root hash */
__u8 salt[32]; /* salt prepended to each hashed block */
__u8 __reserved[144]; /* must be 0's */
};
-Note that the ``sig_size`` field must be set to 0 for the purpose of
-computing the file measurement, even if a signature was provided (or
-will be provided) to `FS_IOC_ENABLE_VERITY`_.
-
Built-in signature verification
===============================
-With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
-a portion of an authentication policy (see `Use cases`_) in the
-kernel. Specifically, it adds support for:
+CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y adds supports for in-kernel
+verification of fs-verity builtin signatures.
+
+**IMPORTANT**! Please take great care before using this feature.
+It is not the only way to do signatures with fs-verity, and the
+alternatives (such as userspace signature verification, and IMA
+appraisal) can be much better. It's also easy to fall into a trap
+of thinking this feature solves more problems than it actually does.
-1. At fs-verity module initialization time, a keyring ".fs-verity" is
- created. The root user can add trusted X.509 certificates to this
- keyring using the add_key() system call, then (when done)
- optionally use keyctl_restrict_keyring() to prevent additional
- certificates from being added.
+Enabling this option adds the following:
+
+1. At boot time, the kernel creates a keyring named ".fs-verity". The
+ root user can add trusted X.509 certificates to this keyring using
+ the add_key() system call.
2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
- detached signature in DER format of the file measurement. On
- success, this signature is persisted alongside the Merkle tree.
- Then, any time the file is opened, the kernel will verify the
- file's actual measurement against this signature, using the
- certificates in the ".fs-verity" keyring.
+ detached signature in DER format of the file's fs-verity digest.
+ On success, the ioctl persists the signature alongside the Merkle
+ tree. Then, any time the file is opened, the kernel verifies the
+ file's actual digest against this signature, using the certificates
+ in the ".fs-verity" keyring.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
- correctly signed file measurement as described in (2).
+ correctly signed digest as described in (2).
-File measurements must be signed in the following format, which is
-similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
+The data that the signature as described in (2) must be a signature of
+is the fs-verity file digest in the following format::
- struct fsverity_signed_digest {
+ struct fsverity_formatted_digest {
char magic[8]; /* must be "FSVerity" */
__le16 digest_algorithm;
__le16 digest_size;
__u8 digest[];
};
-fs-verity's built-in signature verification support is meant as a
-relatively simple mechanism that can be used to provide some level of
-authenticity protection for verity files, as an alternative to doing
-the signature verification in userspace or using IMA-appraisal.
-However, with this mechanism, userspace programs still need to check
-that the verity bit is set, and there is no protection against verity
-files being swapped around.
+That's it. It should be emphasized again that fs-verity builtin
+signatures are not the only way to do signatures with fs-verity. See
+`Use cases`_ for an overview of ways in which fs-verity can be used.
+fs-verity builtin signatures have some major limitations that should
+be carefully considered before using them:
+
+- Builtin signature verification does *not* make the kernel enforce
+ that any files actually have fs-verity enabled. Thus, it is not a
+ complete authentication policy. Currently, if it is used, the only
+ way to complete the authentication policy is for trusted userspace
+ code to explicitly check whether files have fs-verity enabled with a
+ signature before they are accessed. (With
+ fs.verity.require_signatures=1, just checking whether fs-verity is
+ enabled suffices.) But, in this case the trusted userspace code
+ could just store the signature alongside the file and verify it
+ itself using a cryptographic library, instead of using this feature.
+
+- A file's builtin signature can only be set at the same time that
+ fs-verity is being enabled on the file. Changing or deleting the
+ builtin signature later requires re-creating the file.
+
+- Builtin signature verification uses the same set of public keys for
+ all fs-verity enabled files on the system. Different keys cannot be
+ trusted for different files; each key is all or nothing.
+
+- The sysctl fs.verity.require_signatures applies system-wide.
+ Setting it to 1 only works when all users of fs-verity on the system
+ agree that it should be set to 1. This limitation can prevent
+ fs-verity from being used in cases where it would be helpful.
+
+- Builtin signature verification can only use signature algorithms
+ that are supported by the kernel. For example, the kernel does not
+ yet support Ed25519, even though this is often the signature
+ algorithm that is recommended for new cryptographic designs.
+
+- fs-verity builtin signatures are in PKCS#7 format, and the public
+ keys are in X.509 format. These formats are commonly used,
+ including by some other kernel features (which is why the fs-verity
+ builtin signatures use them), and are very feature rich.
+ Unfortunately, history has shown that code that parses and handles
+ these formats (which are from the 1990s and are based on ASN.1)
+ often has vulnerabilities as a result of their complexity. This
+ complexity is not inherent to the cryptography itself.
+
+ fs-verity users who do not need advanced features of X.509 and
+ PKCS#7 should strongly consider using simpler formats, such as plain
+ Ed25519 keys and signatures, and verifying signatures in userspace.
+
+ fs-verity users who choose to use X.509 and PKCS#7 anyway should
+ still consider that verifying those signatures in userspace is more
+ flexible (for other reasons mentioned earlier in this document) and
+ eliminates the need to enable CONFIG_FS_VERITY_BUILTIN_SIGNATURES
+ and its associated increase in kernel attack surface. In some cases
+ it can even be necessary, since advanced X.509 and PKCS#7 features
+ do not always work as intended with the kernel. For example, the
+ kernel does not check X.509 certificate validity times.
+
+ Note: IMA appraisal, which supports fs-verity, does not use PKCS#7
+ for its signatures, so it partially avoids the issues discussed
+ here. IMA appraisal does use X.509.
Filesystem support
==================
-fs-verity is currently supported by the ext4 and f2fs filesystems.
-The CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
-on either filesystem.
+fs-verity is supported by several filesystems, described below. The
+CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity on
+any of these filesystems.
``include/linux/fsverity.h`` declares the interface between the
``fs/verity/`` support layer and filesystems. Briefly, filesystems
@@ -412,17 +559,19 @@ To create verity files on an ext4 filesystem, the filesystem must have
been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
it. "verity" is an RO_COMPAT filesystem feature, so once set, old
kernels will only be able to mount the filesystem readonly, and old
-versions of e2fsck will be unable to check the filesystem. Moreover,
-currently ext4 only supports mounting a filesystem with the "verity"
-feature when its block size is equal to PAGE_SIZE (often 4096 bytes).
+versions of e2fsck will be unable to check the filesystem.
+
+Originally, an ext4 filesystem with the "verity" feature could only be
+mounted when its block size was equal to the system page size
+(typically 4096 bytes). In Linux v6.3, this limitation was removed.
ext4 sets the EXT4_VERITY_FL on-disk inode flag on verity files. It
can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be cleared.
ext4 also supports encryption, which can be used simultaneously with
fs-verity. In this case, the plaintext data is verified rather than
-the ciphertext. This is necessary in order to make the file
-measurement meaningful, since every file is encrypted differently.
+the ciphertext. This is necessary in order to make the fs-verity file
+digest meaningful, since every file is encrypted differently.
ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
@@ -435,9 +584,7 @@ support paging multi-gigabyte xattrs into memory, and to support
encrypting xattrs. Note that the verity metadata *must* be encrypted
when the file is, since it contains hashes of the plaintext data.
-Currently, ext4 verity only supports the case where the Merkle tree
-block size, filesystem block size, and page size are all the same. It
-also only supports extent-based files.
+ext4 only allows verity on extent-based files.
f2fs
----
@@ -455,11 +602,17 @@ Like ext4, f2fs stores the verity metadata (Merkle tree and
fsverity_descriptor) past the end of the file, starting at the first
64K boundary beyond i_size. See explanation for ext4 above.
Moreover, f2fs supports at most 4096 bytes of xattr entries per inode
-which wouldn't be enough for even a single Merkle tree block.
+which usually wouldn't be enough for even a single Merkle tree block.
-Currently, f2fs verity only supports a Merkle tree block size of 4096.
-Also, f2fs doesn't support enabling verity on files that currently
-have atomic or volatile writes pending.
+f2fs doesn't support enabling verity on files that currently have
+atomic or volatile writes pending.
+
+btrfs
+-----
+
+btrfs supports fs-verity since Linux v5.15. Verity-enabled inodes are
+marked with a RO_COMPAT inode flag, and the verity metadata is stored
+in separate btree items.
Implementation details
======================
@@ -476,52 +629,49 @@ already verified). Below, we describe how filesystems implement this.
Pagecache
~~~~~~~~~
-For filesystems using Linux's pagecache, the ``->readpage()`` and
-``->readpages()`` methods must be modified to verify pages before they
-are marked Uptodate. Merely hooking ``->read_iter()`` would be
+For filesystems using Linux's pagecache, the ``->read_folio()`` and
+``->readahead()`` methods must be modified to verify folios before
+they are marked Uptodate. Merely hooking ``->read_iter()`` would be
insufficient, since ``->read_iter()`` is not used for memory maps.
-Therefore, fs/verity/ provides a function fsverity_verify_page() which
-verifies a page that has been read into the pagecache of a verity
-inode, but is still locked and not Uptodate, so it's not yet readable
-by userspace. As needed to do the verification,
-fsverity_verify_page() will call back into the filesystem to read
-Merkle tree pages via fsverity_operations::read_merkle_tree_page().
+Therefore, fs/verity/ provides the function fsverity_verify_blocks()
+which verifies data that has been read into the pagecache of a verity
+inode. The containing folio must still be locked and not Uptodate, so
+it's not yet readable by userspace. As needed to do the verification,
+fsverity_verify_blocks() will call back into the filesystem to read
+hash blocks via fsverity_operations::read_merkle_tree_page().
-fsverity_verify_page() returns false if verification failed; in this
-case, the filesystem must not set the page Uptodate. Following this,
+fsverity_verify_blocks() returns false if verification failed; in this
+case, the filesystem must not set the folio Uptodate. Following this,
as per the usual Linux pagecache behavior, attempts by userspace to
-read() from the part of the file containing the page will fail with
-EIO, and accesses to the page within a memory map will raise SIGBUS.
-
-fsverity_verify_page() currently only supports the case where the
-Merkle tree block size is equal to PAGE_SIZE (often 4096 bytes).
+read() from the part of the file containing the folio will fail with
+EIO, and accesses to the folio within a memory map will raise SIGBUS.
-In principle, fsverity_verify_page() verifies the entire path in the
-Merkle tree from the data page to the root hash. However, for
-efficiency the filesystem may cache the hash pages. Therefore,
-fsverity_verify_page() only ascends the tree reading hash pages until
-an already-verified hash page is seen, as indicated by the PageChecked
-bit being set. It then verifies the path to that page.
+In principle, verifying a data block requires verifying the entire
+path in the Merkle tree from the data block to the root hash.
+However, for efficiency the filesystem may cache the hash blocks.
+Therefore, fsverity_verify_blocks() only ascends the tree reading hash
+blocks until an already-verified hash block is seen. It then verifies
+the path to that block.
This optimization, which is also used by dm-verity, results in
excellent sequential read performance. This is because usually (e.g.
-127 in 128 times for 4K blocks and SHA-256) the hash page from the
+127 in 128 times for 4K blocks and SHA-256) the hash block from the
bottom level of the tree will already be cached and checked from
-reading a previous data page. However, random reads perform worse.
+reading a previous data block. However, random reads perform worse.
Block device based filesystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Block device based filesystems (e.g. ext4 and f2fs) in Linux also use
the pagecache, so the above subsection applies too. However, they
-also usually read many pages from a file at once, grouped into a
+also usually read many data blocks from a file at once, grouped into a
structure called a "bio". To make it easier for these types of
filesystems to support fs-verity, fs/verity/ also provides a function
-fsverity_verify_bio() which verifies all pages in a bio.
+fsverity_verify_bio() which verifies all data blocks in a bio.
ext4 and f2fs also support encryption. If a verity file is also
-encrypted, the pages must be decrypted before being verified. To
+encrypted, the data must be decrypted before being verified. To
support this, these filesystems allocate a "post-read context" for
each bio and store it in ``->bi_private``::
@@ -536,17 +686,17 @@ each bio and store it in ``->bi_private``::
verity, or both is enabled. After the bio completes, for each needed
postprocessing step the filesystem enqueues the bio_post_read_ctx on a
workqueue, and then the workqueue work does the decryption or
-verification. Finally, pages where no decryption or verity error
-occurred are marked Uptodate, and the pages are unlocked.
+verification. Finally, folios where no decryption or verity error
+occurred are marked Uptodate, and the folios are unlocked.
-Files on ext4 and f2fs may contain holes. Normally, ``->readpages()``
-simply zeroes holes and sets the corresponding pages Uptodate; no bios
-are issued. To prevent this case from bypassing fs-verity, these
-filesystems use fsverity_verify_page() to verify hole pages.
+On many filesystems, files can contain holes. Normally,
+``->readahead()`` simply zeroes hole blocks and considers the
+corresponding data to be up-to-date; no bios are issued. To prevent
+this case from bypassing fs-verity, filesystems use
+fsverity_verify_blocks() to verify hole blocks.
-ext4 and f2fs disable direct I/O on verity files, since otherwise
-direct I/O would bypass fs-verity. (They also do the same for
-encrypted files.)
+Filesystems also disable direct I/O on verity files, since otherwise
+direct I/O would bypass fs-verity.
Userspace utility
=================
@@ -554,7 +704,7 @@ Userspace utility
This document focuses on the kernel, but a userspace utility for
fs-verity can be found at:
- https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
+ https://git.kernel.org/pub/scm/fs/fsverity/fsverity-utils.git
See the README.md file in the fsverity-utils source tree for details,
including examples of setting up fs-verity protected files.
@@ -565,7 +715,7 @@ Tests
To test fs-verity, use xfstests. For example, using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
- kvm-xfstests -c ext4,f2fs -g verity
+ kvm-xfstests -c ext4,f2fs,btrfs -g verity
FAQ
===
@@ -581,19 +731,19 @@ weren't already directly answered in other parts of this document.
hashed and what to do with those hashes, such as log them,
authenticate them, or add them to a measurement list.
- IMA is planned to support the fs-verity hashing mechanism as an
- alternative to doing full file hashes, for people who want the
- performance and security benefits of the Merkle tree based hash.
- But it doesn't make sense to force all uses of fs-verity to be
- through IMA. As a standalone filesystem feature, fs-verity
- already meets many users' needs, and it's testable like other
+ IMA supports the fs-verity hashing mechanism as an alternative
+ to full file hashes, for those who want the performance and
+ security benefits of the Merkle tree based hash. However, it
+ doesn't make sense to force all uses of fs-verity to be through
+ IMA. fs-verity already meets many users' needs even as a
+ standalone filesystem feature, and it's testable like other
filesystem features e.g. with xfstests.
:Q: Isn't fs-verity useless because the attacker can just modify the
hashes in the Merkle tree, which is stored on-disk?
:A: To verify the authenticity of an fs-verity file you must verify
- the authenticity of the "file measurement", which is basically the
- root hash of the Merkle tree. See `Use cases`_.
+ the authenticity of the "fs-verity file digest", which
+ incorporates the root hash of the Merkle tree. See `Use cases`_.
:Q: Isn't fs-verity useless because the attacker can just replace a
verity file with a non-verity one?
@@ -659,7 +809,7 @@ weren't already directly answered in other parts of this document.
retrofit existing filesystems with new consistency mechanisms.
Data journalling is available on ext4, but is very slow.
- - Rebuilding the the Merkle tree after every write, which would be
+ - Rebuilding the Merkle tree after every write, which would be
extremely inefficient. Alternatively, a different authenticated
dictionary structure such as an "authenticated skiplist" could
be used. However, this would be far more complex.
@@ -688,25 +838,25 @@ weren't already directly answered in other parts of this document.
e.g. magically trigger construction of a Merkle tree.
:Q: Does fs-verity support remote filesystems?
-:A: Only ext4 and f2fs support is implemented currently, but in
- principle any filesystem that can store per-file verity metadata
- can support fs-verity, regardless of whether it's local or remote.
- Some filesystems may have fewer options of where to store the
- verity metadata; one possibility is to store it past the end of
- the file and "hide" it from userspace by manipulating i_size. The
- data verification functions provided by ``fs/verity/`` also assume
- that the filesystem uses the Linux pagecache, but both local and
- remote filesystems normally do so.
+:A: So far all filesystems that have implemented fs-verity support are
+ local filesystems, but in principle any filesystem that can store
+ per-file verity metadata can support fs-verity, regardless of
+ whether it's local or remote. Some filesystems may have fewer
+ options of where to store the verity metadata; one possibility is
+ to store it past the end of the file and "hide" it from userspace
+ by manipulating i_size. The data verification functions provided
+ by ``fs/verity/`` also assume that the filesystem uses the Linux
+ pagecache, but both local and remote filesystems normally do so.
:Q: Why is anything filesystem-specific at all? Shouldn't fs-verity
be implemented entirely at the VFS level?
:A: There are many reasons why this is not possible or would be very
difficult, including the following:
- - To prevent bypassing verification, pages must not be marked
+ - To prevent bypassing verification, folios must not be marked
Uptodate until they've been verified. Currently, each
- filesystem is responsible for marking pages Uptodate via
- ``->readpages()``. Therefore, currently it's not possible for
+ filesystem is responsible for marking folios Uptodate via
+ ``->readahead()``. Therefore, currently it's not possible for
the VFS to do the verification on its own. Changing this would
require significant changes to the VFS and all filesystems.
diff --git a/Documentation/filesystems/fuse-io.txt b/Documentation/filesystems/fuse-io.rst
index 07b8f73f100f..6464de4266ad 100644
--- a/Documentation/filesystems/fuse-io.txt
+++ b/Documentation/filesystems/fuse-io.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Fuse I/O Modes
+==============
+
Fuse supports the following I/O modes:
- direct-io
@@ -9,7 +15,8 @@ The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the
FUSE_OPEN reply.
In direct-io mode the page cache is completely bypassed for reads and writes.
-No read-ahead takes place. Shared mmap is disabled.
+No read-ahead takes place. Shared mmap is disabled by default. To allow shared
+mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply.
In cached mode reads may be satisfied from the page cache, and data may be
read-ahead by the kernel to fill the cache. The cache is always kept consistent
diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse.rst
index cd717f9bf940..1e31e87aee68 100644
--- a/Documentation/filesystems/fuse.rst
+++ b/Documentation/filesystems/fuse.rst
@@ -47,7 +47,7 @@ filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
The userspace library and utilities are available from the
-`FUSE homepage: <http://fuse.sourceforge.net/>`_
+`FUSE homepage: <https://github.com/libfuse/>`_
Filesystem type
===============
@@ -279,7 +279,7 @@ How are requirements fulfilled?
the filesystem or not.
Note that the *ptrace* check is not strictly necessary to
- prevent B/2/i, it is enough to check if mount owner has enough
+ prevent C/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since *SIGSTOP* can be used to get a similar effect.
@@ -288,10 +288,29 @@ I think these limitations are unacceptable?
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
-mounts, it can relax the last limitation with a 'user_allow_other'
-config option. If this config option is set, the mounting user can
-add the 'allow_other' mount option which disables the check for other
-users' processes.
+mounts, it can relax the last limitation in several ways:
+
+ - With the 'user_allow_other' config option. If this config option is
+ set, the mounting user can add the 'allow_other' mount option which
+ disables the check for other users' processes.
+
+ User namespaces have an unintuitive interaction with 'allow_other':
+ an unprivileged user - normally restricted from mounting with
+ 'allow_other' - could do so in a user namespace where they're
+ privileged. If any process could access such an 'allow_other' mount
+ this would give the mounting user the ability to manipulate
+ processes in user namespaces where they're unprivileged. For this
+ reason 'allow_other' restricts access to users in the same userns
+ or a descendant.
+
+ - With the 'allow_sys_admin_access' module option. If this option is
+ set, super user's processes have unrestricted access to mounts
+ irrespective of allow_other setting or user namespace of the
+ mounting user.
+
+Note that both of these relaxations expose the system to potential
+information leak or *DoS* as described in points B and C/2/i-ii in the
+preceding section.
Kernel - userspace interface
============================
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.rst
index 7059623635b2..8a5842929b60 100644
--- a/Documentation/filesystems/gfs2-glocks.txt
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -1,5 +1,8 @@
- Glock internal locking rules
- ------------------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Glock internal locking rules
+============================
This documents the basic principles of the glock state machine
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
@@ -17,31 +20,34 @@ The gl_holders list contains all the queued lock requests (not
just the holders) associated with the glock. If there are any
held locks, then they will be contiguous entries at the head
of the list. Locks are granted in strictly the order that they
-are queued, except for those marked LM_FLAG_PRIORITY which are
-used only during recovery, and even then only for journal locks.
+are queued.
There are three lock states that users of the glock layer can request,
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
to the following DLM lock modes:
-Glock mode | DLM lock mode
-------------------------------
- UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
- SH | PR (Protected read)
- DF | CW (Concurrent write)
- EX | EX (Exclusive)
+========== ====== =====================================================
+Glock mode DLM lock mode
+========== ====== =====================================================
+ UN IV/NL Unlocked (no DLM lock associated with glock) or NL
+ SH PR (Protected read)
+ DF CW (Concurrent write)
+ EX EX (Exclusive)
+========== ====== =====================================================
Thus DF is basically a shared mode which is incompatible with the "normal"
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
---------------------------------------------------------------------------
- UN | No | No | No | No
- SH | Yes | Yes | No | No
- DF | No | Yes | No | No
- EX | Yes | Yes | Yes | Yes
+========== ========== ============== ========== ==============
+Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
+========== ========== ============== ========== ==============
+ UN No No No No
+ SH Yes Yes No No
+ DF No Yes No No
+ EX Yes Yes Yes Yes
+========== ========== ============== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -49,27 +55,29 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-Field | Purpose
-----------------------------------------------------------------------------
-go_xmote_th | Called before remote state change (e.g. to sync dirty data)
-go_xmote_bh | Called after remote state change (e.g. to refill cache)
-go_inval | Called if remote state change requires invalidating the cache
-go_demote_ok | Returns boolean value of whether its ok to demote a glock
- | (e.g. checks timeout, and that there is no cached data)
-go_lock | Called for the first local holder of a lock
-go_unlock | Called on the final local unlock of a lock
-go_dump | Called to print content of object for debugfs file, or on
- | error to dump glock to the log.
-go_type | The type of the glock, LM_TYPE_.....
-go_callback | Called if the DLM sends a callback to drop this lock
-go_flags | GLOF_ASPACE is set, if the glock has an address space
- | associated with it
+============= =============================================================
+Field Purpose
+============= =============================================================
+go_xmote_th Called before remote state change (e.g. to sync dirty data)
+go_xmote_bh Called after remote state change (e.g. to refill cache)
+go_inval Called if remote state change requires invalidating the cache
+go_demote_ok Returns boolean value of whether its ok to demote a glock
+ (e.g. checks timeout, and that there is no cached data)
+go_lock Called for the first local holder of a lock
+go_unlock Called on the final local unlock of a lock
+go_dump Called to print content of object for debugfs file, or on
+ error to dump glock to the log.
+go_type The type of the glock, ``LM_TYPE_*``
+go_callback Called if the DLM sends a callback to drop this lock
+go_flags GLOF_ASPACE is set, if the glock has an address space
+ associated with it
+============= =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
prevent a situation where locks are being bounced around the cluster
from node to node with none of the nodes making any progress. This
-tends to show up most with shared mmaped files which are being written
+tends to show up most with shared mmapped files which are being written
to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
@@ -82,21 +90,25 @@ rather than via the glock.
Locking rules for glock operations:
-Operation | GLF_LOCK bit lock held | gl_lockref.lock spinlock held
--------------------------------------------------------------------------
-go_xmote_th | Yes | No
-go_xmote_bh | Yes | No
-go_inval | Yes | No
-go_demote_ok | Sometimes | Yes
-go_lock | Yes | No
-go_unlock | Yes | No
-go_dump | Sometimes | Yes
-go_callback | Sometimes (N/A) | Yes
-
-N.B. Operations must not drop either the bit lock or the spinlock
-if its held on entry. go_dump and do_demote_ok must never block.
-Note that go_dump will only be called if the glock's state
-indicates that it is caching uptodate data.
+============= ====================== =============================
+Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
+============= ====================== =============================
+go_xmote_th Yes No
+go_xmote_bh Yes No
+go_inval Yes No
+go_demote_ok Sometimes Yes
+go_lock Yes No
+go_unlock Yes No
+go_dump Sometimes Yes
+go_callback Sometimes (N/A) Yes
+============= ====================== =============================
+
+.. Note::
+
+ Operations must not drop either the bit lock or the spinlock
+ if its held on entry. go_dump and do_demote_ok must never block.
+ Note that go_dump will only be called if the glock's state
+ indicates that it is caching uptodate data.
Glock locking order within GFS2:
@@ -104,7 +116,7 @@ Glock locking order within GFS2:
2. Rename glock (for rename only)
3. Inode glock(s)
(Parents before children, inodes at "same level" with same parent in
- lock number order)
+ lock number order)
4. Rgrp glock(s) (for (de)allocation operations)
5. Transaction glock (via gfs2_trans_begin) for non-read operations
6. i_rw_mutex (if required)
@@ -117,8 +129,8 @@ determine the lifetime of the inode in question. Locking of inodes
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
In general we prefer to lock local locks prior to cluster locks.
- Glock Statistics
- ------------------
+Glock Statistics
+----------------
The stats are divided into two sets: those relating to the
super block and those relating to an individual glock. The
@@ -173,8 +185,8 @@ we'd like to get a better idea of these timings:
1. To be able to better set the glock "min hold time"
2. To spot performance issues more easily
3. To improve the algorithm for selecting resource groups for
-allocation (to base it on lock wait time, rather than blindly
-using a "try lock")
+ allocation (to base it on lock wait time, rather than blindly
+ using a "try lock")
Due to the smoothing action of the updates, a step change in
some input quantity being sampled will only fully be taken
@@ -195,10 +207,13 @@ as possible. There are always inaccuracies in any
measuring system, but I hope this is as accurate as we
can reasonably make it.
-Per sb stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/sbstats
-Per glock stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/glstats
+Per sb stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/sbstats
+
+Per glock stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/glstats
Assuming that debugfs is mounted on /sys/kernel/debug and also
that <fsname> is replaced with the name of the gfs2 filesystem
@@ -206,14 +221,16 @@ in question.
The abbreviations used in the output as are follows:
-srtt - Smoothed round trip time for non-blocking dlm requests
-srttvar - Variance estimate for srtt
-srttb - Smoothed round trip time for (potentially) blocking dlm requests
-srttvarb - Variance estimate for srttb
-sirt - Smoothed inter-request time (for dlm requests)
-sirtvar - Variance estimate for sirt
-dlm - Number of dlm requests made (dcnt in glstats file)
-queue - Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
+srtt Smoothed round trip time for non blocking dlm requests
+srttvar Variance estimate for srtt
+srttb Smoothed round trip time for (potentially) blocking dlm requests
+srttvarb Variance estimate for srttb
+sirt Smoothed inter request time (for dlm requests)
+sirtvar Variance estimate for sirt
+dlm Number of dlm requests made (dcnt in glstats file)
+queue Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
The sbstats file contains a set of these stats for each glock type (so 8 lines
for each type) and for each cpu (one column per cpu). The glstats file contains
@@ -224,9 +241,12 @@ The gfs2_glock_lock_time tracepoint prints out the current values of the stats
for the glock in question, along with some addition information on each dlm
reply that is received:
-status - The status of the dlm request
-flags - The dlm request flags
-tdiff - The time taken by this specific request
+====== =======================================
+status The status of the dlm request
+flags The dlm request flags
+tdiff The time taken by this specific request
+====== =======================================
+
(remaining fields as per above list)
diff --git a/Documentation/filesystems/gfs2.rst b/Documentation/filesystems/gfs2.rst
index 8d1ab589ce18..1bc48a13430c 100644
--- a/Documentation/filesystems/gfs2.rst
+++ b/Documentation/filesystems/gfs2.rst
@@ -1,53 +1,52 @@
.. SPDX-License-Identifier: GPL-2.0
-==================
-Global File System
-==================
+====================
+Global File System 2
+====================
-https://fedorahosted.org/cluster/wiki/HomePage
-
-GFS is a cluster file system. It allows a cluster of computers to
+GFS2 is a cluster file system. It allows a cluster of computers to
simultaneously use a block device that is shared between them (with FC,
-iSCSI, NBD, etc). GFS reads and writes to the block device like a local
+iSCSI, NBD, etc). GFS2 reads and writes to the block device like a local
file system, but also uses a lock module to allow the computers coordinate
their I/O so file system consistency is maintained. One of the nifty
-features of GFS is perfect consistency -- changes made to the file system
+features of GFS2 is perfect consistency -- changes made to the file system
on one machine show up immediately on all other machines in the cluster.
-GFS uses interchangeable inter-node locking mechanisms, the currently
+GFS2 uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are:
lock_nolock
- - allows gfs to be used as a local file system
+ - allows GFS2 to be used as a local file system
lock_dlm
- - uses a distributed lock manager (dlm) for inter-node locking.
+ - uses the distributed lock manager (dlm) for inter-node locking.
The dlm is found at linux/fs/dlm/
-Lock_dlm depends on user space cluster management systems found
+lock_dlm depends on user space cluster management systems found
at the URL above.
-To use gfs as a local file system, no external clustering systems are
+To use GFS2 as a local file system, no external clustering systems are
needed, simply::
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
-If you are using Fedora, you need to install the gfs2-utils package
-and, for lock_dlm, you will also need to install the cman package
-and write a cluster.conf as per the documentation. For F17 and above
-cman has been replaced by the dlm package.
+The gfs2-utils package is required on all cluster nodes and, for lock_dlm, you
+will also need the dlm and corosync user space utilities configured as per the
+documentation.
+
+gfs2-utils can be found at https://pagure.io/gfs2-utils
GFS2 is not on-disk compatible with previous versions of GFS, but it
is pretty close.
-The following man pages can be found at the URL above:
+The following man pages are available from gfs2-utils:
============ =============================================
fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem
- gfs2_convert to convert a gfs filesystem to gfs2 in-place
+ gfs2_convert to convert a gfs filesystem to GFS2 in-place
mkfs.gfs2 to make a filesystem
============ =============================================
diff --git a/Documentation/filesystems/hfs.rst b/Documentation/filesystems/hfs.rst
index ab17a005e9b1..776015c80e3f 100644
--- a/Documentation/filesystems/hfs.rst
+++ b/Documentation/filesystems/hfs.rst
@@ -76,7 +76,7 @@ Creating HFS filesystems
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
-<http://www.mars.org/home/rob/proj/hfs/> for details.
+<https://www.mars.org/home/rob/proj/hfs/> for details.
Credits
diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst
index 0db152278572..7e0dd2f4373e 100644
--- a/Documentation/filesystems/hpfs.rst
+++ b/Documentation/filesystems/hpfs.rst
@@ -7,7 +7,7 @@ Read/Write HPFS 2.09
1998-2004, Mikulas Patocka
:email: mikulas@artax.karlin.mff.cuni.cz
-:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+:homepage: https://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
Credits
=======
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
new file mode 100644
index 000000000000..ac0af679e61e
--- /dev/null
+++ b/Documentation/filesystems/idmappings.rst
@@ -0,0 +1,1039 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Idmappings
+==========
+
+Most filesystem developers will have encountered idmappings. They are used when
+reading from or writing ownership to disk, reporting ownership to userspace, or
+for permission checking. This document is aimed at filesystem developers that
+want to know how idmappings work.
+
+Formal notes
+------------
+
+An idmapping is essentially a translation of a range of ids into another or the
+same range of ids. The notational convention for idmappings that is widely used
+in userspace is::
+
+ u:k:r
+
+``u`` indicates the first element in the upper idmapset ``U`` and ``k``
+indicates the first element in the lower idmapset ``K``. The ``r`` parameter
+indicates the range of the idmapping, i.e. how many ids are mapped. From now
+on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
+we're talking about an id in the upper or lower idmapset.
+
+To see what this looks like in practice, let's take the following idmapping::
+
+ u22:k10000:r3
+
+and write down the mappings it will generate::
+
+ u22 -> k10000
+ u23 -> k10001
+ u24 -> k10002
+
+From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
+idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
+order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
+the set of all possible ids usable on a given system.
+
+Looking at this mathematically briefly will help us highlight some properties
+that make it easier to understand how we can translate between idmappings. For
+example, we know that the inverse idmapping is an order isomorphism as well::
+
+ k10000 -> u22
+ k10001 -> u23
+ k10002 -> u24
+
+Given that we are dealing with order isomorphisms plus the fact that we're
+dealing with subsets we can embed idmappings into each other, i.e. we can
+sensibly translate between different idmappings. For example, assume we've been
+given the three idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r10000
+ 3. u0:k30000:r10000
+
+and id ``k11000`` which has been generated by the first idmapping by mapping
+``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
+
+Because we're dealing with order isomorphic subsets it is meaningful to ask
+what id ``k11000`` corresponds to in the second or third idmapping. The
+straightforward algorithm to use is to apply the inverse of the first idmapping,
+mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
+either the second idmapping mapping or third idmapping mapping. The second
+idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
+``u1000`` down to ``u31000``.
+
+If we were given the same task for the following three idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r200
+ 3. u0:k30000:r300
+
+we would fail to translate as the sets aren't order isomorphic over the full
+range of the first idmapping anymore (However they are order isomorphic over
+the full range of the second idmapping.). Neither the second or third idmapping
+contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
+an id mapped. We can simply say that ``u1000`` is unmapped in the second and
+third idmapping. The kernel will report unmapped ids as the overflowuid
+``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
+
+The algorithm to calculate what a given id maps to is pretty simple. First, we
+need to verify that the range can contain our target id. We will skip this step
+for simplicity. After that if we want to know what ``id`` maps to we can do
+simple calculations:
+
+- If we want to map from left to right::
+
+ u:k:r
+ id - u + k = n
+
+- If we want to map from right to left::
+
+ u:k:r
+ id - k + u = n
+
+Instead of "left to right" we can also say "down" and instead of "right to
+left" we can also say "up". Obviously mapping down and up invert each other.
+
+To see whether the simple formulas above work, consider the following two
+idmappings::
+
+ 1. u0:k20000:r10000
+ 2. u500:k30000:r10000
+
+Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
+want to know what id this was mapped from in the upper idmapset of the first
+idmapping. So we're mapping up in the first idmapping::
+
+ id - k + u = n
+ k21000 - k20000 + u0 = u1000
+
+Now assume we are given the id ``u1100`` in the upper idmapset of the second
+idmapping and we want to know what this id maps down to in the lower idmapset
+of the second idmapping. This means we're mapping down in the second
+idmapping::
+
+ id - u + k = n
+ u1100 - u500 + k30000 = k30600
+
+General notes
+-------------
+
+In the context of the kernel an idmapping can be interpreted as mapping a range
+of userspace ids into a range of kernel ids::
+
+ userspace-id:kernel-id:range
+
+A userspace id is always an element in the upper idmapset of an idmapping of
+type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
+idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
+"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
+types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
+
+The kernel is mostly concerned with kernel ids. They are used when performing
+permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
+A userspace id on the other hand is an id that is reported to userspace by the
+kernel, or is passed by userspace to the kernel, or a raw device id that is
+written or read from disk.
+
+Note that we are only concerned with idmappings as the kernel stores them not
+how userspace would specify them.
+
+For the rest of this document we will prefix all userspace ids with ``u`` and
+all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
+an idmapping will be written as ``u0:k10000:r10000``.
+
+For example, within this idmapping, the id ``u1000`` is an id in the upper
+idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
+``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
+starting with ``k10000``.
+
+A kernel id is always created by an idmapping. Such idmappings are associated
+with user namespaces. Since we mainly care about how idmappings work we're not
+going to be concerned with how idmappings are created nor how they are used
+outside of the filesystem context. This is best left to an explanation of user
+namespaces.
+
+The initial user namespace is special. It always has an idmapping of the
+following form::
+
+ u0:k0:r4294967295
+
+which is an identity idmapping over the full range of ids available on this
+system.
+
+Other user namespaces usually have non-identity idmappings such as::
+
+ u0:k10000:r10000
+
+When a process creates or wants to change ownership of a file, or when the
+ownership of a file is read from disk by a filesystem, the userspace id is
+immediately translated into a kernel id according to the idmapping associated
+with the relevant user namespace.
+
+For instance, consider a file that is stored on disk by a filesystem as being
+owned by ``u1000``:
+
+- If a filesystem were to be mounted in the initial user namespaces (as most
+ filesystems are) then the initial idmapping will be used. As we saw this is
+ simply the identity idmapping. This would mean id ``u1000`` read from disk
+ would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
+ would contain ``k1000``.
+
+- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
+ then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
+ ``i_uid`` and ``i_gid`` would contain ``k11000``.
+
+Translation algorithms
+----------------------
+
+We've already seen briefly that it is possible to translate between different
+idmappings. We'll now take a closer look how that works.
+
+Crossmapping
+~~~~~~~~~~~~
+
+This translation algorithm is used by the kernel in quite a few places. For
+example, it is used when reporting back the ownership of a file to userspace
+via the ``stat()`` system call family.
+
+If we've been given ``k11000`` from one idmapping we can map that id up in
+another idmapping. In order for this to work both idmappings need to contain
+the same kernel id in their kernel idmapsets. For example, consider the
+following idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u20000:k10000:r10000
+
+and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
+then translate ``k11000`` into a userspace id in the second idmapping using the
+kernel idmapset of the second idmapping::
+
+ /* Map the kernel id up into a userspace id in the second idmapping. */
+ from_kuid(u20000:k10000:r10000, k11000) = u21000
+
+Note, how we can get back to the kernel id in the first idmapping by inverting
+the algorithm::
+
+ /* Map the userspace id down into a kernel id in the second idmapping. */
+ make_kuid(u20000:k10000:r10000, u21000) = k11000
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+This algorithm allows us to answer the question what userspace id a given
+kernel id corresponds to in a given idmapping. In order to be able to answer
+this question both idmappings need to contain the same kernel id in their
+respective kernel idmapsets.
+
+For example, when the kernel reads a raw userspace id from disk it maps it down
+into a kernel id according to the idmapping associated with the filesystem.
+Let's assume the filesystem was mounted with an idmapping of
+``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
+means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
+the inode's ``i_uid`` and ``i_gid`` field.
+
+When someone in userspace calls ``stat()`` or a related function to get
+ownership information about the file the kernel can't simply map the id back up
+according to the filesystem's idmapping as this would give the wrong owner if
+the caller is using an idmapping.
+
+So the kernel will map the id back up in the idmapping of the caller. Let's
+assume the caller has the somewhat unconventional idmapping
+``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
+Consequently the user would see that this file is owned by ``u4000``.
+
+Remapping
+~~~~~~~~~
+
+It is possible to translate a kernel id from one idmapping to another one via
+the userspace idmapset of the two idmappings. This is equivalent to remapping
+a kernel id.
+
+Let's look at an example. We are given the following two idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r10000
+
+and we are given ``k11000`` in the first idmapping. In order to translate this
+kernel id in the first idmapping into a kernel id in the second idmapping we
+need to perform two steps:
+
+1. Map the kernel id up into a userspace id in the first idmapping::
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+2. Map the userspace id down into a kernel id in the second idmapping::
+
+ /* Map the userspace id down into a kernel id in the second idmapping. */
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+As you can see we used the userspace idmapset in both idmappings to translate
+the kernel id in one idmapping to a kernel id in another idmapping.
+
+This allows us to answer the question what kernel id we would need to use to
+get the same userspace id in another idmapping. In order to be able to answer
+this question both idmappings need to contain the same userspace id in their
+respective userspace idmapsets.
+
+Note, how we can easily get back to the kernel id in the first idmapping by
+inverting the algorithm:
+
+1. Map the kernel id up into a userspace id in the second idmapping::
+
+ /* Map the kernel id up into a userspace id in the second idmapping. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+2. Map the userspace id down into a kernel id in the first idmapping::
+
+ /* Map the userspace id down into a kernel id in the first idmapping. */
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+Another way to look at this translation is to treat it as inverting one
+idmapping and applying another idmapping if both idmappings have the relevant
+userspace id mapped. This will come in handy when working with idmapped mounts.
+
+Invalid translations
+~~~~~~~~~~~~~~~~~~~~
+
+It is never valid to use an id in the kernel idmapset of one idmapping as the
+id in the userspace idmapset of another or the same idmapping. While the kernel
+idmapset always indicates an idmapset in the kernel id space the userspace
+idmapset indicates a userspace id. So the following translations are forbidden::
+
+ /* Map the userspace id down into a kernel id in the first idmapping. */
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+ /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
+ make_kuid(u10000:k20000:r10000, k110000) = k21000
+ ~~~~~~~
+
+and equally wrong::
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+ /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
+ from_kuid(u20000:k0:r10000, u1000) = k21000
+ ~~~~~
+
+Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
+``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
+conflated. So the two examples above would cause a compilation failure.
+
+Idmappings when creating filesystem objects
+-------------------------------------------
+
+The concepts of mapping an id down or mapping an id up are expressed in the two
+kernel functions filesystem developers are rather familiar with and which we've
+already used in this document::
+
+ /* Map the userspace id down into a kernel id. */
+ make_kuid(idmapping, uid)
+
+ /* Map the kernel id up into a userspace id. */
+ from_kuid(idmapping, kuid)
+
+We will take an abbreviated look into how idmappings figure into creating
+filesystem objects. For simplicity we will only look at what happens when the
+VFS has already completed path lookup right before it calls into the filesystem
+itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
+called. We will also assume that the directory we're creating filesystem
+objects in is readable and writable for everyone.
+
+When creating a filesystem object the caller will look at the caller's
+filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
+but they are exclusively used when determining file ownership which is why they
+are called "filesystem ids". They are usually identical to the uid and gid of
+the caller but can differ. We will just assume they are always identical to not
+get lost in too many details.
+
+When the caller enters the kernel two things happen:
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping.
+ (To be precise, the kernel will simply look at the kernel ids stashed in the
+ credentials of the current task but for our education we'll pretend this
+ translation happens just in time.)
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping.
+
+The second step is important as regular filesystem will ultimately need to map
+the kernel id back up into a userspace id when writing to disk.
+So with the second step the kernel guarantees that a valid userspace id can be
+written to disk. If it can't the kernel will refuse the creation request to not
+even remotely risk filesystem corruption.
+
+The astute reader will have realized that this is simply a variation of the
+crossmapping algorithm we mentioned above in a previous section. First, the
+kernel maps the caller's userspace id down into a kernel id according to the
+caller's idmapping and then maps that kernel id up according to the
+filesystem's idmapping.
+
+From the implementation point it's worth mentioning how idmappings are represented.
+All idmappings are taken from the corresponding user namespace.
+
+ - caller's idmapping (usually taken from ``current_user_ns()``)
+ - filesystem's idmapping (``sb->s_user_ns``)
+ - mount's idmapping (``mnt_idmap(vfsmnt)``)
+
+Let's see some examples with caller/filesystem idmapping but without mount
+idmappings. This will exhibit some problems we can hit. After that we will
+revisit/reconsider these examples, this time using mount idmappings, to see how
+they can solve the problems we observed before.
+
+Example 1
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+
+Both the caller and the filesystem use the identity idmapping:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping.
+
+ For this second step the kernel will call the function
+ ``fsuidgid_has_mapping()`` which ultimately boils down to calling
+ ``from_kuid()``::
+
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+In this example both idmappings are the same so there's nothing exciting going
+on. Ultimately the userspace id that lands on disk will be ``u1000``.
+
+Example 2
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k20000:r10000, k11000) = u-1
+
+It's immediately clear that while the caller's userspace id could be
+successfully mapped down into kernel ids in the caller's idmapping the kernel
+ids could not be mapped up according to the filesystem's idmapping. So the
+kernel will deny this creation request.
+
+Note that while this example is less common, because most filesystem can't be
+mounted with non-initial idmappings this is a general problem as we can see in
+the next examples.
+
+Example 3
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k11000) = u11000
+
+We can see that the translation always succeeds. The userspace id that the
+filesystem will ultimately put to disk will always be identical to the value of
+the kernel id that was created in the caller's idmapping. This has mainly two
+consequences.
+
+First, that we can't allow a caller to ultimately write to disk with another
+userspace id. We could only do this if we were to mount the whole filesystem
+with the caller's or another idmapping. But that solution is limited to a few
+filesystems and not very flexible. But this is a use-case that is pretty
+important in containerized workloads.
+
+Second, the caller will usually not be able to create any files or access
+directories that have stricter permissions because none of the filesystem's
+kernel ids map up into valid userspace ids in the caller's idmapping
+
+1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Map kernel ids up to userspace ids in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k1000) = u-1
+
+Example 4
+~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+
+In order to report ownership to userspace the kernel uses the crossmapping
+algorithm introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k1000) = u-1
+
+The crossmapping algorithm fails in this case because the kernel id in the
+filesystem idmapping cannot be mapped up to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Example 5
+~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+
+In order to report ownership to userspace the kernel uses the crossmapping
+algorithm introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k21000) = u-1
+
+Again, the crossmapping algorithm fails in this case because the kernel id in
+the filesystem idmapping cannot be mapped to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Note how in the last two examples things would be simple if the caller would be
+using the initial idmapping. For a filesystem mounted with the initial
+idmapping it would be trivial. So we only consider a filesystem with an
+idmapping of ``u0:k20000:r10000``:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k21000) = u21000
+
+Idmappings on idmapped mounts
+-----------------------------
+
+The examples we've seen in the previous section where the caller's idmapping
+and the filesystem's idmapping are incompatible causes various issues for
+workloads. For a more complex but common example, consider two containers
+started on the host. To completely prevent the two containers from affecting
+each other, an administrator may often use different non-overlapping idmappings
+for the two containers::
+
+ container1 idmapping: u0:k10000:r10000
+ container2 idmapping: u0:k20000:r10000
+ filesystem idmapping: u0:k30000:r10000
+
+An administrator wanting to provide easy read-write access to the following set
+of files::
+
+ dir id: u0
+ dir/file1 id: u1000
+ dir/file2 id: u2000
+
+to both containers currently can't.
+
+Of course the administrator has the option to recursively change ownership via
+``chown()``. For example, they could change ownership so that ``dir`` and all
+files below it can be crossmapped from the filesystem's into the container's
+idmapping. Let's assume they change ownership so it is compatible with the
+first container's idmapping::
+
+ dir id: u10000
+ dir/file1 id: u11000
+ dir/file2 id: u12000
+
+This would still leave ``dir`` rather useless to the second container. In fact,
+``dir`` and all files below it would continue to appear owned by the overflowid
+for the second container.
+
+Or consider another increasingly popular example. Some service managers such as
+systemd implement a concept called "portable home directories". A user may want
+to use their home directories on different machines where they are assigned
+different login userspace ids. Most users will have ``u1000`` as the login id
+on their machine at home and all files in their home directory will usually be
+owned by ``u1000``. At uni or at work they may have another login id such as
+``u1125``. This makes it rather difficult to interact with their home directory
+on their work machine.
+
+In both cases changing ownership recursively has grave implications. The most
+obvious one is that ownership is changed globally and permanently. In the home
+directory case this change in ownership would even need to happen every time the
+user switches from their home to their work machine. For really large sets of
+files this becomes increasingly costly.
+
+If the user is lucky, they are dealing with a filesystem that is mountable
+inside user namespaces. But this would also change ownership globally and the
+change in ownership is tied to the lifetime of the filesystem mount, i.e. the
+superblock. The only way to change ownership is to completely unmount the
+filesystem and mount it again in another user namespace. This is usually
+impossible because it would mean that all users currently accessing the
+filesystem can't anymore. And it means that ``dir`` still can't be shared
+between two containers with different idmappings.
+But usually the user doesn't even have this option since most filesystems
+aren't mountable inside containers. And not having them mountable might be
+desirable as it doesn't require the filesystem to deal with malicious
+filesystem images.
+
+But the usecases mentioned above and more can be handled by idmapped mounts.
+They allow to expose the same set of dentries with different ownership at
+different mounts. This is achieved by marking the mounts with a user namespace
+through the ``mount_setattr()`` system call. The idmapping associated with it
+is then used to translate from the caller's idmapping to the filesystem's
+idmapping and vica versa using the remapping algorithm we introduced above.
+
+Idmapped mounts make it possible to change ownership in a temporary and
+localized way. The ownership changes are restricted to a specific mount and the
+ownership changes are tied to the lifetime of the mount. All other users and
+locations where the filesystem is exposed are unaffected.
+
+Filesystems that support idmapped mounts don't have any real reason to support
+being mountable inside user namespaces. A filesystem could be exposed
+completely under an idmapped mount to get the same effect. This has the
+advantage that filesystems can leave the creation of the superblock to
+privileged users in the initial user namespace.
+
+However, it is perfectly possible to combine idmapped mounts with filesystems
+mountable inside user namespaces. We will touch on this further below.
+
+Filesystem types vs idmapped mount types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the introduction of idmapped mounts we need to distinguish between
+filesystem ownership and mount ownership of a VFS object such as an inode. The
+owner of a inode might be different when looked at from a filesystem
+perspective than when looked at from an idmapped mount. Such fundamental
+conceptual distinctions should almost always be clearly expressed in the code.
+So, to distinguish idmapped mount ownership from filesystem ownership separate
+types have been introduced.
+
+If a uid or gid has been generated using the filesystem or caller's idmapping
+then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
+has been generated using a mount idmapping then we will be using the dedicated
+``vfsuid_t`` and ``vfsgid_t`` types.
+
+All VFS helpers that generate or take uids and gids as arguments use the
+``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
+to catch errors that originate from conflating filesystem and VFS uids and gids.
+
+The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
+and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
+from and to ``uid_t`` and ``gid_t`` types::
+
+ uid_t <--> kuid_t <--> vfsuid_t
+ gid_t <--> kgid_t <--> vfsgid_t
+
+Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
+e.g., during ``stat()``, or store ownership information in a shared VFS object
+based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
+use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
+
+To illustrate why this helper currently exists, consider what happens when we
+change ownership of an inode from an idmapped mount. After we generated
+a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
+this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
+Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
+or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
+``vfsgid_into_kgid()``.
+
+Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
+``struct posix_acl``, stores ownership information a filesystem or "global"
+``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
+and ``vfsgid_t`` is specific to an idmapped mount.
+
+We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
+on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
+on filesystem idmappings. To prevent abusing filesystem idmappings to generate
+``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
+or ``kgid_t`` types filesystem idmappings and mount idmappings are different
+types as well.
+
+All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
+a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
+a filesystem or caller idmapping will cause a compilation error.
+
+Similar to how we prefix all userspace ids in this document with ``u`` and all
+kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
+idmapping will be written as: ``u0:v10000:r10000``.
+
+Remapping helpers
+~~~~~~~~~~~~~~~~~
+
+Idmapping functions were added that translate between idmappings. They make use
+of the remapping algorithm we've introduced earlier. We're going to look at:
+
+- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
+
+ The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
+ VFS ids in the mount's idmapping::
+
+ /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(filesystem, kid) = uid
+
+ /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
+ make_kuid(mount, uid) = kuid
+
+- ``mapped_fsuid()`` and ``mapped_fsgid()``
+
+ The ``mapped_fs*id()`` functions translate the caller's kernel ids into
+ kernel ids in the filesystem's idmapping. This translation is achieved by
+ remapping the caller's VFS ids using the mount's idmapping::
+
+ /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(mount, kid) = uid
+
+ /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(filesystem, uid) = kuid
+
+- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
+
+ Whenever
+
+Note that these two functions invert each other. Consider the following
+idmappings::
+
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
+to ``k21000`` according to its idmapping. This is what is stored in the
+inode's ``i_uid`` and ``i_gid`` fields.
+
+When the caller queries the ownership of this file via ``stat()`` the kernel
+would usually simply use the crossmapping algorithm and map the filesystem's
+kernel id up to a userspace id in the caller's idmapping.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
+id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k21000):
+ /* Map the filesystem's kernel id up into a userspace id. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+ /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+Finally, when the kernel reports the owner to the caller it will turn the
+VFS id in the mount's idmapping into a userspace id in the caller's
+idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+We can test whether this algorithm really works by verifying what happens when
+we create a new file. Let's say the user is creating a file with ``u1000``.
+
+The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
+kernel would now apply the crossmapping, verifying that ``k11000`` can be
+mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
+be mapped up in the filesystem's idmapping directly this creation request
+fails.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
+a VFS id according to the mount's idmapping::
+
+ mapped_fsuid(k11000):
+ /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+ /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:v20000:r10000, u1000) = v21000
+
+When finally writing to disk the kernel will then map ``v21000`` up into a
+userspace id in the filesystem's idmapping::
+
+ k21000 = vfsuid_into_kuid(v21000)
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+As we can see, we end up with an invertible and therefore information
+preserving algorithm. A file created from ``u1000`` on an idmapped mount will
+also be reported as being owned by ``u1000`` and vica versa.
+
+Let's now briefly reconsider the failing examples from earlier in the context
+of idmapped mounts.
+
+Example 2 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+When the caller is using a non-initial idmapping the common case is to attach
+the same idmapping to the mount. We now perform three steps:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v11000):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:v10000:r10000, v11000) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+So the ownership that lands on disk will be ``u1000``.
+
+Example 3 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u0:v10000:r10000
+
+The same translation algorithm works with the third example.
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v11000):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:v10000:r10000, v11000) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k21000) = u1000
+
+So the ownership that lands on disk will be ``u1000``.
+
+Example 4 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u0:v10000:r10000
+
+In order to report ownership to userspace the kernel now does three steps using
+the translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k1000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file will now be created
+with ``u1000`` according to the mount's idmapping.
+
+Example 5 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+Again, in order to report ownership to userspace the kernel now does three
+steps using the translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k21000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file is now owned by
+``u1000`` according to the mount's idmapping.
+
+Changing ownership on a home directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We've seen above how idmapped mounts can be used to translate between
+idmappings when either the caller, the filesystem or both uses a non-initial
+idmapping. A wide range of usecases exist when the caller is using
+a non-initial idmapping. This mostly happens in the context of containerized
+workloads. The consequence is as we have seen that for both, filesystem's
+mounted with the initial idmapping and filesystems mounted with non-initial
+idmappings, access to the filesystem isn't working because the kernel ids can't
+be crossmapped between the caller's and the filesystem's idmapping.
+
+As we've seen above idmapped mounts provide a solution to this by remapping the
+caller's or filesystem's idmapping according to the mount's idmapping.
+
+Aside from containerized workloads, idmapped mounts have the advantage that
+they also work when both the caller and the filesystem use the initial
+idmapping which means users on the host can change the ownership of directories
+and files on a per-mount basis.
+
+Consider our previous example where a user has their home directory on portable
+storage. At home they have id ``u1000`` and all files in their home directory
+are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
+
+Taking their home directory with them becomes problematic. They can't easily
+access their files, they might not be able to write to disk without applying
+lax permissions or ACLs and even if they can, they will end up with an annoying
+mix of files and directories owned by ``u1000`` and ``u1125``.
+
+Idmapped mounts allow to solve this problem. A user can create an idmapped
+mount for their home directory on their work computer or their computer at home
+depending on what ownership they would prefer to end up on the portable storage
+itself.
+
+Let's assume they want all files on disk to belong to ``u1000``. When the user
+plugs in their portable storage at their work station they can setup a job that
+creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
+when they create a file the kernel performs the following steps we already know
+from above:::
+
+ caller id: u1125
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u1000:v1125:r1
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1125) = k1125
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v1125):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u1000:v1125:r1, v1125) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+So ultimately the file will be created with ``u1000`` on disk.
+
+Now let's briefly look at what ownership the caller with id ``u1125`` will see
+on their work computer:
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u1000:v1125:r1
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k1000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u1000:v1125:r1, u1000) = v1125
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k1125 = vfsuid_into_kuid(v1125)
+ from_kuid(u0:k0:r4294967295, k1125) = u1125
+
+So ultimately the caller will be reported that the file belongs to ``u1125``
+which is the caller's userspace id on their workstation in our example.
+
+The raw userspace id that is put on disk is ``u1000`` so when the user takes
+their home directory back to their home computer where they are assigned
+``u1000`` using the initial idmapping and mount the filesystem with the initial
+idmapping they will see all those files owned by ``u1000``.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index e7b46dac7079..e18bc5ae3b35 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -24,6 +24,20 @@ algorithms work.
splice
locking
directory-locking
+ devpts
+ dnotify
+ fiemap
+ files
+ locks
+ mount_api
+ quota
+ seq_file
+ sharedsubtree
+ idmappings
+
+ automount-support
+
+ caching/index
porting
@@ -39,6 +53,7 @@ filesystem implementations.
journalling
fscrypt
fsverity
+ netfs_library
Filesystems
===========
@@ -58,7 +73,10 @@ Documentation for filesystem implementations.
bfs
btrfs
ceph
+ coda
+ configfs
cramfs
+ dax
debugfs
dlmfs
ecryptfs
@@ -66,18 +84,22 @@ Documentation for filesystem implementations.
erofs
ext2
ext3
+ ext4/index
f2fs
gfs2
gfs2-uevents
+ gfs2-glocks
hfs
hfsplus
hpfs
fuse
+ fuse-io
inotify
isofs
nilfs2
nfs/index
ntfs
+ ntfs3
ocfs2
ocfs2-online-filecheck
omfs
@@ -88,13 +110,16 @@ Documentation for filesystem implementations.
ramfs-rootfs-initramfs
relay
romfs
+ smb/index
+ spufs/index
squashfs
sysfs
sysv-fs
tmpfs
ubifs
- ubifs-authentication.rst
+ ubifs-authentication
udf
virtiofs
vfat
+ xfs/index
zonefs
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..e18f90ffc6fd 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -10,27 +10,27 @@ Details
The journalling layer is easy to use. You need to first of all create a
journal_t data structure. There are two calls to do this dependent on
how you decide to allocate the physical media on which the journal
-resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
-filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
+resides. The jbd2_journal_init_inode() call is for journals stored in
+filesystem inodes, or the jbd2_journal_init_dev() call can be used
for journal stored on a raw device (in a continuous range of blocks). A
journal_t is a typedef for a struct pointer, so when you are finally
-finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
+finished make sure you call jbd2_journal_destroy() on it to free up
any used kernel memory.
Once you have got your journal_t object you need to 'mount' or load the
journal file. The journalling layer expects the space for the journal
was already allocated and initialized properly by the userspace tools.
-When loading the journal you must call :c:func:`jbd2_journal_load` to process
+When loading the journal you must call jbd2_journal_load() to process
journal contents. If the client file system detects the journal contents
does not need to be processed (or even need not have valid contents), it
-may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
-calling :c:func:`jbd2_journal_load`.
+may call jbd2_journal_wipe() to clear the journal contents before
+calling jbd2_journal_load().
Note that jbd2_journal_wipe(..,0) calls
-:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
-transactions in the journal and similarly :c:func:`jbd2_journal_load` will
-call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
-:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
+jbd2_journal_skip_recovery() for you if it detects any outstanding
+transactions in the journal and similarly jbd2_journal_load() will
+call jbd2_journal_recover() if necessary. I would advise reading
+ext4_load_journal() in fs/ext4/super.c for examples on this stage.
Now you can go ahead and start modifying the underlying filesystem.
Almost.
@@ -39,57 +39,57 @@ You still need to actually journal your filesystem changes, this is done
by wrapping them into transactions. Additionally you also need to wrap
the modification of each of the buffers with calls to the journal layer,
so it knows what the modifications you are actually making are. To do
-this use :c:func:`jbd2_journal_start` which returns a transaction handle.
+this use jbd2_journal_start() which returns a transaction handle.
-:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
+jbd2_journal_start() and its counterpart jbd2_journal_stop(),
which indicates the end of a transaction are nestable calls, so you can
reenter a transaction if necessary, but remember you must call
-:c:func:`jbd2_journal_stop` the same number of times as
-:c:func:`jbd2_journal_start` before the transaction is completed (or more
+jbd2_journal_stop() the same number of times as
+jbd2_journal_start() before the transaction is completed (or more
accurately leaves the update phase). Ext4/VFS makes use of this feature to
simplify handling of inode dirtying, quota support, etc.
Inside each transaction you need to wrap the modifications to the
individual buffers (blocks). Before you start to modify a buffer you
-need to call :c:func:`jbd2_journal_get_create_access()` /
-:c:func:`jbd2_journal_get_write_access()` /
-:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
+need to call jbd2_journal_get_create_access() /
+jbd2_journal_get_write_access() /
+jbd2_journal_get_undo_access() as appropriate, this allows the
journalling layer to copy the unmodified
data if it needs to. After all the buffer may be part of a previously
uncommitted transaction. At this point you are at last ready to modify a
buffer, and once you are have done so you need to call
-:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
+jbd2_journal_dirty_metadata(). Or if you've asked for access to a
buffer you now know is now longer required to be pushed back on the
-device you can call :c:func:`jbd2_journal_forget` in much the same way as you
-might have used :c:func:`bforget` in the past.
+device you can call jbd2_journal_forget() in much the same way as you
+might have used bforget() in the past.
-A :c:func:`jbd2_journal_flush` may be called at any time to commit and
+A jbd2_journal_flush() may be called at any time to commit and
checkpoint all your transactions.
-Then at umount time , in your :c:func:`put_super` you can then call
-:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
+Then at umount time , in your put_super() you can then call
+jbd2_journal_destroy() to clean up your in-core journal object.
Unfortunately there a couple of ways the journal layer can cause a
deadlock. The first thing to note is that each task can only have a
single outstanding transaction at any one time, remember nothing commits
-until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
+until the outermost jbd2_journal_stop(). This means you must complete
the transaction at the end of each file/inode/address etc. operation you
perform, so that the journalling system isn't re-entered on another
journal. Since transactions can't be nested/batched across differing
journals, and another filesystem other than yours (say ext4) may be
modified in a later syscall.
-The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
+The second case to bear in mind is that jbd2_journal_start() can block
if there isn't enough space in the journal for your transaction (based
on the passed nblocks param) - when it blocks it merely(!) needs to wait
for transactions to complete and be committed from other tasks, so
-essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
-deadlocks you must treat :c:func:`jbd2_journal_start` /
-:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
+essentially we are waiting for jbd2_journal_stop(). So to avoid
+deadlocks you must treat jbd2_journal_start() /
+jbd2_journal_stop() as if they were semaphores and include them in
your semaphore ordering rules to prevent
-deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
-behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
-easily as on :c:func:`jbd2_journal_start`.
+deadlocks. Note that jbd2_journal_extend() has similar blocking
+behaviour to jbd2_journal_start() so you can deadlock here just as
+easily as on jbd2_journal_start().
Try to reserve the right number of blocks the first time. ;-). This will
be the maximum number of blocks you are going to touch in this
@@ -116,8 +116,8 @@ called after each transaction commit. You can also use
that need processing when the transaction commits.
JBD2 also provides a way to block all transaction updates via
-:c:func:`jbd2_journal_lock_updates()` /
-:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
+jbd2_journal_lock_updates() /
+jbd2_journal_unlock_updates(). Ext4 uses this when it wants a
window with a clean and stable fs for a moment. E.g.
::
@@ -132,6 +132,37 @@ The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.
+Fast commits
+~~~~~~~~~~~~
+
+JBD2 to also allows you to perform file-system specific delta commits known as
+fast commits. In order to use fast commits, you will need to set following
+callbacks that perform correspodning work:
+
+`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
+fast commit.
+
+`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
+blocks.
+
+File system is free to perform fast commits as and when it wants as long as it
+gets permission from JBD2 to do so by calling the function
+:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
+file system should tell JBD2 about it by calling
+:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+commit immediately after stopping the fast commit it can do so by calling
+:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
+fails for some reason and the only way to guarantee consistency is for JBD2 to
+perform the full traditional commit.
+
+JBD2 helper functions to manage fast commit buffers. File system can use
+:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
+and wait on IO completion of fast commit buffers.
+
+Currently, only Ext4 implements fast commits. For details of its implementation
+of fast commits, please refer to the top level comments in
+fs/ext4/fast_commit.c.
+
Summary
~~~~~~~
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..d5bf4b6b7509 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -56,37 +56,43 @@ inode_operations
prototypes::
- int (*create) (struct inode *,struct dentry *,umode_t, bool);
+ int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
- int (*symlink) (struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct inode *,struct dentry *,umode_t);
+ int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
- int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
- int (*rename) (struct inode *, struct dentry *,
+ int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
void (*truncate) (struct inode *);
- int (*permission) (struct inode *, int, unsigned int);
- int (*get_acl)(struct inode *, int);
- int (*setattr) (struct dentry *, struct iattr *);
- int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
+ int (*permission) (struct mnt_idmap *, struct inode *, int, unsigned int);
+ struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
+ int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
+ int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
- int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+ int (*tmpfile) (struct mnt_idmap *, struct inode *,
+ struct file *, umode_t);
+ int (*fileattr_set)(struct mnt_idmap *idmap,
+ struct dentry *dentry, struct fileattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
locking rules:
all may block
-============ =============================================
+============== ==================================================
ops i_rwsem(inode)
-============ =============================================
+============== ==================================================
lookup: shared
create: exclusive
link: exclusive (both)
@@ -95,11 +101,12 @@ symlink: exclusive
mkdir: exclusive
unlink: exclusive (both)
rmdir: exclusive (both)(see below)
-rename: exclusive (all) (see below)
+rename: exclusive (both parents, some children) (see below)
readlink: no
get_link: no
setattr: exclusive
permission: no (may not block if called in rcu-walk mode)
+get_inode_acl: no
get_acl: no
getattr: no
listxattr: no
@@ -107,12 +114,18 @@ fiemap: no
update_time: no
atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
-============ =============================================
+fileattr_get: no or exclusive
+fileattr_set: exclusive
+get_offset_ctx no
+============== ==================================================
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
+ involved.
+ ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent.
See Documentation/filesystems/directory-locking.rst for more detailed discussion
of the locking scheme for directory operations.
@@ -126,9 +139,10 @@ prototypes::
int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
struct inode *inode, const char *name, void *buffer,
size_t size);
- int (*set)(const struct xattr_handler *handler, struct dentry *dentry,
- struct inode *inode, const char *name, const void *buffer,
- size_t size, int flags);
+ int (*set)(const struct xattr_handler *handler,
+ struct mnt_idmap *idmap,
+ struct dentry *dentry, struct inode *inode, const char *name,
+ const void *buffer, size_t size, int flags);
locking rules:
all may block
@@ -163,7 +177,6 @@ prototypes::
int (*show_options)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
- int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
locking rules:
All may block [not true, see below]
@@ -188,7 +201,6 @@ umount_begin: no
show_options: no (namespace_sem)
quota_read: no (see below)
quota_write: no (see below)
-bdev_try_to_free_page: no (see below)
====================== ============ ========================
->statfs() has s_umount (shared) when called by ustat(2) (native or
@@ -204,9 +216,6 @@ dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking
see also dquot_operations section.
-->bdev_try_to_free_page is called from the ->releasepage handler of
-the block device inode. See there for more details.
-
file_system_type
================
@@ -236,67 +245,64 @@ address_space_operations
prototypes::
int (*writepage)(struct page *page, struct writeback_control *wbc);
- int (*readpage)(struct file *, struct page *);
+ int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
- int (*set_page_dirty)(struct page *page);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
+ bool (*dirty_folio)(struct address_space *, struct folio *folio);
+ void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
+ loff_t pos, unsigned len,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
- void (*invalidatepage) (struct page *, unsigned int, unsigned int);
- int (*releasepage) (struct page *, int);
- void (*freepage)(struct page *);
+ void (*invalidate_folio) (struct folio *, size_t start, size_t len);
+ bool (*release_folio)(struct folio *, gfp_t);
+ void (*free_folio)(struct folio *);
int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- bool (*isolate_page) (struct page *, isolate_mode_t);
- int (*migratepage)(struct address_space *, struct page *, struct page *);
- void (*putback_page) (struct page *);
- int (*launder_page)(struct page *);
- int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
- int (*error_remove_page)(struct address_space *, struct page *);
- int (*swap_activate)(struct file *);
+ int (*migrate_folio)(struct address_space *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
+ int (*launder_folio)(struct folio *);
+ bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
+ int (*error_remove_folio)(struct address_space *, struct folio *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
locking rules:
- All except set_page_dirty and freepage may block
+ All except dirty_folio and free_folio may block
-====================== ======================== =========
-ops PageLocked(page) i_rwsem
-====================== ======================== =========
+====================== ======================== ========= ===============
+ops folio locked i_rwsem invalidate_lock
+====================== ======================== ========= ===============
writepage: yes, unlocks (see below)
-readpage: yes, unlocks
+read_folio: yes, unlocks shared
writepages:
-set_page_dirty no
-readpages:
+dirty_folio: maybe
+readahead: yes, unlocks shared
write_begin: locks the page exclusive
write_end: yes, unlocks exclusive
bmap:
-invalidatepage: yes
-releasepage: yes
-freepage: yes
+invalidate_folio: yes exclusive
+release_folio: yes
+free_folio: yes
direct_IO:
-isolate_page: yes
-migratepage: yes (both)
-putback_page: yes
-launder_page: yes
+migrate_folio: yes (both)
+launder_folio: yes
is_partially_uptodate: yes
-error_remove_page: yes
+error_remove_folio: yes
swap_activate: no
swap_deactivate: no
-====================== ======================== =========
+swap_rw: yes, unlocks
+====================== ======================== ========= ===============
-->write_begin(), ->write_end() and ->readpage() may be called from
+->write_begin(), ->write_end() and ->read_folio() may be called from
the request handler (/dev/loop).
-->readpage() unlocks the page, either synchronously or via I/O
+->read_folio() unlocks the folio, either synchronously or via I/O
completion.
-->readpages() populates the pagecache with the passed pages and starts
-I/O against them. They come unlocked upon I/O completion.
+->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ
@@ -356,43 +362,57 @@ If nr_to_write is NULL, all dirty pages must be written.
writepages should _only_ write pages which are present on
mapping->io_pages.
-->set_page_dirty() is called from various places in the kernel
-when the target page is marked as needing writeback. It may be called
-under spinlock (it cannot block) and is sometimes called with the page
-not locked.
+->dirty_folio() is called from various places in the kernel when
+the target folio is marked as needing writeback. The folio cannot be
+truncated because either the caller holds the folio lock, or the caller
+has found the folio while holding the page table lock which will block
+truncation.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. Please,
keep it that way and don't breed new callers.
-->invalidatepage() is called when the filesystem must attempt to drop
+->invalidate_folio() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
-returns zero on success. If ->invalidatepage is zero, the kernel uses
-block_invalidatepage() instead.
-
-->releasepage() is called when the kernel is about to try to drop the
-buffers from the page in preparation for freeing it. It returns zero to
-indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
-the kernel assumes that the fs has no private interest in the buffers.
-
-->freepage() is called when the kernel is done dropping the page
+returns zero on success. The filesystem must exclusively acquire
+invalidate_lock before invalidating page cache in truncate / hole punch
+path (and thus calling into ->invalidate_folio) to block races between page
+cache invalidation and page cache filling functions (fault, read, ...).
+
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data. For example,
+it may be about to be removed from the address_space or split. The folio
+is locked and not under writeback. It may be dirty. The gfp parameter
+is not usually used for allocation, but rather to indicate what the
+filesystem may do to attempt to free the private data. The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio. If a filesystem does not provide a ->release_folio method,
+the pagecache will assume that private data is buffer_heads and call
+try_to_free_buffers().
+
+->free_folio() is called when the kernel has dropped the folio
from the page cache.
-->launder_page() may be called prior to releasing a page if
-it is still found to be dirty. It returns zero if the page was successfully
-cleaned, or an error value if not. Note that in order to prevent the page
+->launder_folio() may be called prior to releasing a folio if
+it is still found to be dirty. It returns zero if the folio was successfully
+cleaned, or an error value if not. Note that in order to prevent the folio
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.
-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap. It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation. It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added. If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.
->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.
+->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
+
file_lock_operations
====================
@@ -425,17 +445,23 @@ prototypes::
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
int (*lm_change)(struct file_lock **, int);
+ bool (*lm_breaker_owns_lease)(struct file_lock *);
+ bool (*lm_lock_expirable)(struct file_lock *);
+ void (*lm_expire_lock)(void);
locking rules:
-========== ============= ================= =========
-ops inode->i_lock blocked_lock_lock may block
-========== ============= ================= =========
-lm_notify: yes yes no
+====================== ============= ================= =========
+ops flc_lock blocked_lock_lock may block
+====================== ============= ================= =========
+lm_notify: no yes no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no
-========== ============= ================= =========
+lm_breaker_owns_lease: yes no no
+lm_lock_expirable yes no no
+lm_expire_lock no no yes
+====================== ============= ================= =========
buffer_head
===========
@@ -461,32 +487,25 @@ prototypes::
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t, void **,
unsigned long *);
- int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
- int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules:
======================= ===================
-ops bd_mutex
+ops open_mutex
======================= ===================
open: yes
release: yes
ioctl: no
compat_ioctl: no
direct_access: no
-media_changed: no
unlock_native_capacity: no
-revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
======================= ===================
-media_changed, unlock_native_capacity and revalidate_disk are called only from
-check_disk_change().
-
swap_slot_free_notify is called with swap_lock and sometimes the page lock
held.
@@ -501,7 +520,7 @@ prototypes::
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
- int (*iterate) (struct file *, struct dir_context *);
+ int (*iopoll) (struct kiocb *kiocb, bool spin);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -513,14 +532,6 @@ prototypes::
int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
- loff_t *);
- ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
- loff_t *);
- ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
- void __user *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
- loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
@@ -531,6 +542,14 @@ prototypes::
size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *, int, loff_t, loff_t);
+ void (*show_fdinfo)(struct seq_file *m, struct file *f);
+ unsigned (*mmap_capabilities)(struct file *);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
+ loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t len, unsigned int remap_flags);
+ int (*fadvise)(struct file *, loff_t, loff_t, int);
locking rules:
All may block.
@@ -543,9 +562,8 @@ mutex or just to use i_size_read() instead.
Note: this does not protect the file->f_pos against concurrent modifications
since this is something the userspace has to take care about.
-->iterate() is called with i_rwsem exclusive.
-
-->iterate_shared() is called with i_rwsem at least shared.
+->iterate_shared() is called with i_rwsem held for reading, and with the
+file f_pos_lock held exclusively
->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
Most instances call fasync_helper(), which does that maintenance, so it's
@@ -565,6 +583,25 @@ in sys_read() and friends.
the lease within the individual filesystem to record the result of the
operation
+->fallocate implementation must be really careful to maintain page cache
+consistency when punching holes or performing other operations that invalidate
+page cache contents. Usually the filesystem needs to call
+truncate_inode_pages_range() to invalidate relevant range of the page cache.
+However the filesystem usually also needs to update its internal (and on disk)
+view of file offset -> disk block mapping. Until this update is finished, the
+filesystem needs to block page faults and reads from reloading now-stale page
+cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
+shared mode when loading pages from disk (filemap_fault(), filemap_read(),
+readahead paths), the fallocate implementation must take the invalidate_lock to
+prevent reloading.
+
+->copy_file_range and ->remap_file_range implementations need to serialize
+against modifications of file data while the operation is running. For
+blocking changes through write(2) and similar operations inode->i_rwsem can be
+used. To block changes to file contents via a memory mapping during the
+operation, the filesystem must take mapping->invalidate_lock to coordinate
+with ->page_mkwrite.
+
dquot_operations
================
@@ -601,50 +638,62 @@ vm_operations_struct
prototypes::
- void (*open)(struct vm_area_struct*);
- void (*close)(struct vm_area_struct*);
- vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
+ void (*open)(struct vm_area_struct *);
+ void (*close)(struct vm_area_struct *);
+ vm_fault_t (*fault)(struct vm_fault *);
+ vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
+ vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
-============= ======== ===========================
-ops mmap_sem PageLocked(page)
-============= ======== ===========================
-open: yes
-close: yes
-fault: yes can return with page locked
-map_pages: yes
-page_mkwrite: yes can return with page locked
-pfn_mkwrite: yes
-access: yes
-============= ======== ===========================
-
-->fault() is called when a previously not present pte is about
-to be faulted in. The filesystem must find and return the page associated
-with the passed in "pgoff" in the vm_fault structure. If it is possible that
-the page may be truncated and/or invalidated, then the filesystem must lock
-the page, then ensure it is not already truncated (the page lock will block
+============= ========== ===========================
+ops mmap_lock PageLocked(page)
+============= ========== ===========================
+open: write
+close: read/write
+fault: read can return with page locked
+huge_fault: maybe-read
+map_pages: maybe-read
+page_mkwrite: read can return with page locked
+pfn_mkwrite: read
+access: read
+============= ========== ===========================
+
+->fault() is called when a previously not present pte is about to be faulted
+in. The filesystem must find and return the page associated with the passed in
+"pgoff" in the vm_fault structure. If it is possible that the page may be
+truncated and/or invalidated, then the filesystem must lock invalidate_lock,
+then ensure the page is not already truncated (invalidate_lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
+->huge_fault() is called when there is no PUD or PMD entry present. This
+gives the filesystem the opportunity to install a PUD or PMD sized page.
+Filesystems can also use the ->fault method to return a PMD sized page,
+so implementing this function may not be necessary. In particular,
+filesystems should not call filemap_fault() from ->huge_fault().
+The mmap_lock may not be held when this method is called.
+
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
-till "end_pgoff". ->map_pages() is called with page table locked and must
+till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
-filesystem should skip it. Filesystem should use do_set_pte() to setup
+filesystem should skip it. Filesystem should use set_pte_range() to setup
page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte".
-->page_mkwrite() is called when a previously read-only pte is
-about to become writeable. The filesystem again must ensure that there are
-no truncate/invalidate races, and then return with the page locked. If
-the page has been truncated, the filesystem should not look up a new page
-like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
-will cause the VM to retry the fault.
+->page_mkwrite() is called when a previously read-only pte is about to become
+writeable. The filesystem again must ensure that there are no
+truncate/invalidate races or races with operations such as ->remap_file_range
+or ->copy_file_range, and then return with the page locked. Usually
+mapping->invalidate_lock is suitable for proper serialization. If the page has
+been truncated, the filesystem should not look up a new page like the ->fault()
+handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
+retry the fault.
->pfn_mkwrite() is the same as page_mkwrite but when the pte is
VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.rst
index 5368690f412e..26429317dbbc 100644
--- a/Documentation/filesystems/locks.txt
+++ b/Documentation/filesystems/locks.rst
@@ -1,4 +1,8 @@
- File Locking Release Notes
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+File Locking Release Notes
+==========================
Andy Walker <andy@lysaker.kvaerner.no>
@@ -6,7 +10,7 @@
1. What's New?
---------------
+==============
1.1 Broken Flock Emulation
--------------------------
@@ -25,7 +29,7 @@ anyway (see the file "Documentation/process/changes.rst".)
---------------------------
1.2.1 Typical Problems - Sendmail
----------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because sendmail was unable to use the old flock() emulation, many sendmail
installations use fcntl() instead of flock(). This is true of Slackware 3.0
for example. This gave rise to some other subtle problems if sendmail was
@@ -37,7 +41,7 @@ to lock solid with deadlocked processes.
1.2.2 The Solution
-------------------
+^^^^^^^^^^^^^^^^^^
The solution I have chosen, after much experimentation and discussion,
is to make flock() and fcntl() locks oblivious to each other. Both can
exists, and neither will have any effect on the other.
@@ -53,16 +57,9 @@ fcntl(), with all the problems that implies.
1.3 Mandatory Locking As A Mount Option
---------------------------------------
-Mandatory locking, as described in
-'Documentation/filesystems/mandatory-locking.txt' was prior to this release a
-general configuration option that was valid for all mounted filesystems. This
-had a number of inherent dangers, not the least of which was the ability to
-freeze an NFS server by asking it to read a file for which a mandatory lock
-existed.
-
-From this release of the kernel, mandatory locking can be turned on and off
-on a per-filesystem basis, using the mount options 'mand' and 'nomand'.
-The default is to disallow mandatory locking. The intention is that
-mandatory locking only be enabled on a local filesystem as the specific need
-arises.
+Mandatory locking was prior to this release a general configuration option
+that was valid for all mounted filesystems. This had a number of inherent
+dangers, not the least of which was the ability to freeze an NFS server by
+asking it to read a file for which a mandatory lock existed.
+Such option was dropped in Kernel v5.14.
diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.txt
deleted file mode 100644
index a251ca33164a..000000000000
--- a/Documentation/filesystems/mandatory-locking.txt
+++ /dev/null
@@ -1,181 +0,0 @@
- Mandatory File Locking For The Linux Operating System
-
- Andy Walker <andy@lysaker.kvaerner.no>
-
- 15 April 1996
- (Updated September 2007)
-
-0. Why you should avoid mandatory locking
------------------------------------------
-
-The Linux implementation is prey to a number of difficult-to-fix race
-conditions which in practice make it not dependable:
-
- - The write system call checks for a mandatory lock only once
- at its start. It is therefore possible for a lock request to
- be granted after this check but before the data is modified.
- A process may then see file data change even while a mandatory
- lock was held.
- - Similarly, an exclusive lock may be granted on a file after
- the kernel has decided to proceed with a read, but before the
- read has actually completed, and the reading process may see
- the file data in a state which should not have been visible
- to it.
- - Similar races make the claimed mutual exclusion between lock
- and mmap similarly unreliable.
-
-1. What is mandatory locking?
-------------------------------
-
-Mandatory locking is kernel enforced file locking, as opposed to the more usual
-cooperative file locking used to guarantee sequential access to files among
-processes. File locks are applied using the flock() and fcntl() system calls
-(and the lockf() library routine which is a wrapper around fcntl().) It is
-normally a process' responsibility to check for locks on a file it wishes to
-update, before applying its own lock, updating the file and unlocking it again.
-The most commonly used example of this (and in the case of sendmail, the most
-troublesome) is access to a user's mailbox. The mail user agent and the mail
-transfer agent must guard against updating the mailbox at the same time, and
-prevent reading the mailbox while it is being updated.
-
-In a perfect world all processes would use and honour a cooperative, or
-"advisory" locking scheme. However, the world isn't perfect, and there's
-a lot of poorly written code out there.
-
-In trying to address this problem, the designers of System V UNIX came up
-with a "mandatory" locking scheme, whereby the operating system kernel would
-block attempts by a process to write to a file that another process holds a
-"read" -or- "shared" lock on, and block attempts to both read and write to a
-file that a process holds a "write " -or- "exclusive" lock on.
-
-The System V mandatory locking scheme was intended to have as little impact as
-possible on existing user code. The scheme is based on marking individual files
-as candidates for mandatory locking, and using the existing fcntl()/lockf()
-interface for applying locks just as if they were normal, advisory locks.
-
-Note 1: In saying "file" in the paragraphs above I am actually not telling
-the whole truth. System V locking is based on fcntl(). The granularity of
-fcntl() is such that it allows the locking of byte ranges in files, in addition
-to entire files, so the mandatory locking rules also have byte level
-granularity.
-
-Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
-borrowing the fcntl() locking scheme from System V. The mandatory locking
-scheme is defined by the System V Interface Definition (SVID) Version 3.
-
-2. Marking a file for mandatory locking
----------------------------------------
-
-A file is marked as a candidate for mandatory locking by setting the group-id
-bit in its file mode but removing the group-execute bit. This is an otherwise
-meaningless combination, and was chosen by the System V implementors so as not
-to break existing user programs.
-
-Note that the group-id bit is usually automatically cleared by the kernel when
-a setgid file is written to. This is a security measure. The kernel has been
-modified to recognize the special case of a mandatory lock candidate and to
-refrain from clearing this bit. Similarly the kernel has been modified not
-to run mandatory lock candidates with setgid privileges.
-
-3. Available implementations
-----------------------------
-
-I have considered the implementations of mandatory locking available with
-SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
-
-Generally I have tried to make the most sense out of the behaviour exhibited
-by these three reference systems. There are many anomalies.
-
-All the reference systems reject all calls to open() for a file on which
-another process has outstanding mandatory locks. This is in direct
-contravention of SVID 3, which states that only calls to open() with the
-O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
-definition, which is the "Right Thing", since only calls with O_TRUNC can
-modify the contents of the file.
-
-HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
-just mandatory locks. That would appear to contravene POSIX.1.
-
-mmap() is another interesting case. All the operating systems mentioned
-prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
-also disallows advisory locks for such a file. SVID actually specifies the
-paranoid HP-UX behaviour.
-
-In my opinion only MAP_SHARED mappings should be immune from locking, and then
-only from mandatory locks - that is what is currently implemented.
-
-SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
-mandatory locks, so reads and writes to locked files always block when they
-should return EAGAIN.
-
-I'm afraid that this is such an esoteric area that the semantics described
-below are just as valid as any others, so long as the main points seem to
-agree.
-
-4. Semantics
-------------
-
-1. Mandatory locks can only be applied via the fcntl()/lockf() locking
- interface - in other words the System V/POSIX interface. BSD style
- locks using flock() never result in a mandatory lock.
-
-2. If a process has locked a region of a file with a mandatory read lock, then
- other processes are permitted to read from that region. If any of these
- processes attempts to write to the region it will block until the lock is
- released, unless the process has opened the file with the O_NONBLOCK
- flag in which case the system call will return immediately with the error
- status EAGAIN.
-
-3. If a process has locked a region of a file with a mandatory write lock, all
- attempts to read or write to that region block until the lock is released,
- unless a process has opened the file with the O_NONBLOCK flag in which case
- the system call will return immediately with the error status EAGAIN.
-
-4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
- any mandatory locks owned by other processes will be rejected with the
- error status EAGAIN.
-
-5. Attempts to apply a mandatory lock to a file that is memory mapped and
- shared (via mmap() with MAP_SHARED) will be rejected with the error status
- EAGAIN.
-
-6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
- that has any mandatory locks in effect will be rejected with the error status
- EAGAIN.
-
-5. Which system calls are affected?
------------------------------------
-
-Those which modify a file's contents, not just the inode. That gives read(),
-write(), readv(), writev(), open(), creat(), mmap(), truncate() and
-ftruncate(). truncate() and ftruncate() are considered to be "write" actions
-for the purposes of mandatory locking.
-
-The affected region is usually defined as stretching from the current position
-for the total number of bytes read or written. For the truncate calls it is
-defined as the bytes of a file removed or added (we must also consider bytes
-added, as a lock can specify just "the whole file", rather than a specific
-range of bytes.)
-
-Note 3: I may have overlooked some system calls that need mandatory lock
-checking in my eagerness to get this code out the door. Please let me know, or
-better still fix the system calls yourself and submit a patch to me or Linus.
-
-6. Warning!
------------
-
-Not even root can override a mandatory lock, so runaway processes can wreak
-havoc if they lock crucial files. The way around it is to change the file
-permissions (remove the setgid bit) before trying to read or write to it.
-Of course, that might be a bit tricky if the system is hung :-(
-
-7. The "mand" mount option
---------------------------
-Mandatory locking is disabled on all filesystems by default, and must be
-administratively enabled by mounting with "-o mand". That mount option
-is only allowed if the mounting task has the CAP_SYS_ADMIN capability.
-
-Since kernel v4.5, it is possible to disable mandatory locking
-altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
-with this disabled will reject attempts to mount filesystems with the
-"mand" mount option with the error status EPERM.
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.rst
index 87c14bbb2b35..9aaf6ef75eb5 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.rst
@@ -1,8 +1,10 @@
- ====================
- FILESYSTEM MOUNT API
- ====================
+.. SPDX-License-Identifier: GPL-2.0
-CONTENTS
+====================
+Filesystem Mount API
+====================
+
+.. CONTENTS
(1) Overview.
@@ -21,8 +23,7 @@ CONTENTS
(8) Parameter helper functions.
-========
-OVERVIEW
+Overview
========
The creation of new mounts is now to be done in a multistep process:
@@ -43,7 +44,7 @@ The creation of new mounts is now to be done in a multistep process:
(7) Destroy the context.
-To support this, the file_system_type struct gains two new fields:
+To support this, the file_system_type struct gains two new fields::
int (*init_fs_context)(struct fs_context *fc);
const struct fs_parameter_description *parameters;
@@ -57,12 +58,11 @@ Note that security initialisation is done *after* the filesystem is called so
that the namespaces may be adjusted first.
-======================
-THE FILESYSTEM CONTEXT
+The Filesystem context
======================
The creation and reconfiguration of a superblock is governed by a filesystem
-context. This is represented by the fs_context structure:
+context. This is represented by the fs_context structure::
struct fs_context {
const struct fs_context_operations *ops;
@@ -79,85 +79,112 @@ context. This is represented by the fs_context structure:
unsigned int sb_flags;
unsigned int sb_flags_mask;
unsigned int s_iflags;
- unsigned int lsm_flags;
enum fs_context_purpose purpose:8;
...
};
The fs_context fields are as follows:
- (*) const struct fs_context_operations *ops
+ * ::
+
+ const struct fs_context_operations *ops
These are operations that can be done on a filesystem context (see
below). This must be set by the ->init_fs_context() file_system_type
operation.
- (*) struct file_system_type *fs_type
+ * ::
+
+ struct file_system_type *fs_type
A pointer to the file_system_type of the filesystem that is being
constructed or reconfigured. This retains a reference on the type owner.
- (*) void *fs_private
+ * ::
+
+ void *fs_private
A pointer to the file system's private data. This is where the filesystem
will need to store any options it parses.
- (*) struct dentry *root
+ * ::
+
+ struct dentry *root
A pointer to the root of the mountable tree (and indirectly, the
superblock thereof). This is filled in by the ->get_tree() op. If this
is set, an active reference on root->d_sb must also be held.
- (*) struct user_namespace *user_ns
- (*) struct net *net_ns
+ * ::
+
+ struct user_namespace *user_ns
+ struct net *net_ns
There are a subset of the namespaces in use by the invoking process. They
retain references on each namespace. The subscribed namespaces may be
replaced by the filesystem to reflect other sources, such as the parent
mount superblock on an automount.
- (*) const struct cred *cred
+ * ::
+
+ const struct cred *cred
The mounter's credentials. This retains a reference on the credentials.
- (*) char *source
+ * ::
+
+ char *source
This specifies the source. It may be a block device (e.g. /dev/sda1) or
something more exotic, such as the "host:/path" that NFS desires.
- (*) char *subtype
+ * ::
+
+ char *subtype
This is a string to be added to the type displayed in /proc/mounts to
qualify it (used by FUSE). This is available for the filesystem to set if
desired.
- (*) void *security
+ * ::
+
+ void *security
A place for the LSMs to hang their security data for the superblock. The
relevant security operations are described below.
- (*) void *s_fs_info
+ * ::
+
+ void *s_fs_info
The proposed s_fs_info for a new superblock, set in the superblock by
sget_fc(). This can be used to distinguish superblocks.
- (*) unsigned int sb_flags
- (*) unsigned int sb_flags_mask
+ * ::
+
+ unsigned int sb_flags
+ unsigned int sb_flags_mask
Which bits SB_* flags are to be set/cleared in super_block::s_flags.
- (*) unsigned int s_iflags
+ * ::
+
+ unsigned int s_iflags
These will be bitwise-OR'd with s->s_iflags when a superblock is created.
- (*) enum fs_context_purpose
+ * ::
+
+ enum fs_context_purpose
This indicates the purpose for which the context is intended. The
available values are:
- FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount
- FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
- FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+ ========================== ======================================
+ FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount
+ FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount
+ FS_CONTEXT_FOR_RECONFIGURE Change an existing mount
+ ========================== ======================================
The mount context is created by calling vfs_new_fs_context() or
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
@@ -176,17 +203,16 @@ mount context. For instance, NFS might pin the appropriate protocol version
module.
-=================================
-THE FILESYSTEM CONTEXT OPERATIONS
+The Filesystem Context Operations
=================================
-The filesystem context points to a table of operations:
+The filesystem context points to a table of operations::
struct fs_context_operations {
void (*free)(struct fs_context *fc);
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ struct fs_parameter *param);
int (*parse_monolithic)(struct fs_context *fc, void *data);
int (*get_tree)(struct fs_context *fc);
int (*reconfigure)(struct fs_context *fc);
@@ -195,24 +221,32 @@ The filesystem context points to a table of operations:
These operations are invoked by the various stages of the mount procedure to
manage the filesystem context. They are as follows:
- (*) void (*free)(struct fs_context *fc);
+ * ::
+
+ void (*free)(struct fs_context *fc);
Called to clean up the filesystem-specific part of the filesystem context
when the context is destroyed. It should be aware that parts of the
context may have been removed and NULL'd out by ->get_tree().
- (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ * ::
+
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
Called when a filesystem context has been duplicated to duplicate the
filesystem-private data. An error may be returned to indicate failure to
do this.
- [!] Note that even if this fails, put_fs_context() will be called
+ .. Warning::
+
+ Note that even if this fails, put_fs_context() will be called
immediately thereafter, so ->dup() *must* make the
filesystem-private data safe for ->free().
- (*) int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ * ::
+
+ int (*parse_param)(struct fs_context *fc,
+ struct fs_parameter *param);
Called when a parameter is being added to the filesystem context. param
points to the key name and maybe a value object. VFS-specific options
@@ -224,7 +258,9 @@ manage the filesystem context. They are as follows:
If successful, 0 should be returned or a negative error code otherwise.
- (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+ * ::
+
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
Called when the mount(2) system call is invoked to pass the entire data
page in one go. If this is expected to be just a list of "key[=val]"
@@ -236,7 +272,9 @@ manage the filesystem context. They are as follows:
finds it's the standard key-val list then it may pass it off to
generic_parse_monolithic().
- (*) int (*get_tree)(struct fs_context *fc);
+ * ::
+
+ int (*get_tree)(struct fs_context *fc);
Called to get or create the mountable root and superblock, using the
information stored in the filesystem context (reconfiguration goes via a
@@ -249,7 +287,9 @@ manage the filesystem context. They are as follows:
The phase on a userspace-driven context will be set to only allow this to
be called once on any particular context.
- (*) int (*reconfigure)(struct fs_context *fc);
+ * ::
+
+ int (*reconfigure)(struct fs_context *fc);
Called to effect reconfiguration of a superblock using information stored
in the filesystem context. It may detach any resources it desires from
@@ -259,19 +299,20 @@ manage the filesystem context. They are as follows:
On success it should return 0. In the case of an error, it should return
a negative error code.
- [NOTE] reconfigure is intended as a replacement for remount_fs.
+ .. Note:: reconfigure is intended as a replacement for remount_fs.
-===========================
-FILESYSTEM CONTEXT SECURITY
+Filesystem context Security
===========================
The filesystem context contains a security pointer that the LSMs can use for
building up a security context for the superblock to be mounted. There are a
number of operations used by the new mount code for this purpose:
- (*) int security_fs_context_alloc(struct fs_context *fc,
- struct dentry *reference);
+ * ::
+
+ int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference);
Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. It should return 0 on success or a negative error
@@ -283,22 +324,28 @@ number of operations used by the new mount code for this purpose:
non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
it indicates the automount point.
- (*) int security_fs_context_dup(struct fs_context *fc,
- struct fs_context *src_fc);
+ * ::
+
+ int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. The original filesystem context is pointed to by
src_fc and may be used for reference. It should return 0 on success or a
negative error code on failure.
- (*) void security_fs_context_free(struct fs_context *fc);
+ * ::
+
+ void security_fs_context_free(struct fs_context *fc);
Called to clean up anything attached to fc->security. Note that the
contents may have been transferred to a superblock and the pointer cleared
during get_tree.
- (*) int security_fs_context_parse_param(struct fs_context *fc,
- struct fs_parameter *param);
+ * ::
+
+ int security_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param);
Called for each mount parameter, including the source. The arguments are
as for the ->parse_param() method. It should return 0 to indicate that
@@ -310,7 +357,9 @@ number of operations used by the new mount code for this purpose:
(provided the value pointer is NULL'd out). If it is stolen, 1 must be
returned to prevent it being passed to the filesystem.
- (*) int security_fs_context_validate(struct fs_context *fc);
+ * ::
+
+ int security_fs_context_validate(struct fs_context *fc);
Called after all the options have been parsed to validate the collection
as a whole and to do any necessary allocation so that
@@ -320,36 +369,43 @@ number of operations used by the new mount code for this purpose:
In the case of reconfiguration, the target superblock will be accessible
via fc->root.
- (*) int security_sb_get_tree(struct fs_context *fc);
+ * ::
+
+ int security_sb_get_tree(struct fs_context *fc);
Called during the mount procedure to verify that the specified superblock
is allowed to be mounted and to transfer the security data there. It
should return 0 or a negative error code.
- (*) void security_sb_reconfigure(struct fs_context *fc);
+ * ::
+
+ void security_sb_reconfigure(struct fs_context *fc);
Called to apply any reconfiguration to an LSM's context. It must not
fail. Error checking and resource allocation must be done in advance by
the parameter parsing and validation hooks.
- (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
- unsigned int mnt_flags);
+ * ::
+
+ int security_sb_mountpoint(struct fs_context *fc,
+ struct path *mountpoint,
+ unsigned int mnt_flags);
Called during the mount procedure to verify that the root dentry attached
to the context is permitted to be attached to the specified mountpoint.
It should return 0 on success or a negative error code on failure.
-==========================
-VFS FILESYSTEM CONTEXT API
+VFS Filesystem context API
==========================
There are four operations for creating a filesystem context and one for
destroying a context:
- (*) struct fs_context *fs_context_for_mount(
- struct file_system_type *fs_type,
- unsigned int sb_flags);
+ * ::
+
+ struct fs_context *fs_context_for_mount(struct file_system_type *fs_type,
+ unsigned int sb_flags);
Allocate a filesystem context for the purpose of setting up a new mount,
whether that be with a new superblock or sharing an existing one. This
@@ -359,7 +415,9 @@ destroying a context:
fs_type specifies the filesystem type that will manage the context and
sb_flags presets the superblock flags stored therein.
- (*) struct fs_context *fs_context_for_reconfigure(
+ * ::
+
+ struct fs_context *fs_context_for_reconfigure(
struct dentry *dentry,
unsigned int sb_flags,
unsigned int sb_flags_mask);
@@ -369,7 +427,9 @@ destroying a context:
configured. sb_flags and sb_flags_mask indicate which superblock flags
need changing and to what.
- (*) struct fs_context *fs_context_for_submount(
+ * ::
+
+ struct fs_context *fs_context_for_submount(
struct file_system_type *fs_type,
struct dentry *reference);
@@ -382,7 +442,9 @@ destroying a context:
Note that it's not a requirement that the reference dentry be of the same
filesystem type as fs_type.
- (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+ * ::
+
+ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
Duplicate a filesystem context, copying any options noted and duplicating
or additionally referencing any resources held therein. This is available
@@ -392,14 +454,18 @@ destroying a context:
The purpose in the new context is inherited from the old one.
- (*) void put_fs_context(struct fs_context *fc);
+ * ::
+
+ void put_fs_context(struct fs_context *fc);
Destroy a filesystem context, releasing any resources it holds. This
calls the ->free() operation. This is intended to be called by anyone who
created a filesystem context.
- [!] filesystem contexts are not refcounted, so this causes unconditional
- destruction.
+ .. Warning::
+
+ filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
In all the above operations, apart from the put op, the return is a mount
context pointer or a negative error code.
@@ -407,10 +473,12 @@ context pointer or a negative error code.
For the remaining operations, if an error occurs, a negative error code will be
returned.
- (*) int vfs_parse_fs_param(struct fs_context *fc,
- struct fs_parameter *param);
+ * ::
+
+ int vfs_parse_fs_param(struct fs_context *fc,
+ struct fs_parameter *param);
- Supply a single mount parameter to the filesystem context. This include
+ Supply a single mount parameter to the filesystem context. This includes
the specification of the source/device which is specified as the "source"
parameter (which may be specified multiple times if the filesystem
supports that).
@@ -423,53 +491,64 @@ returned.
The parameter value is typed and can be one of:
- fs_value_is_flag, Parameter not given a value.
- fs_value_is_string, Value is a string
- fs_value_is_blob, Value is a binary blob
- fs_value_is_filename, Value is a filename* + dirfd
- fs_value_is_file, Value is an open file (file*)
+ ==================== =============================
+ fs_value_is_flag Parameter not given a value
+ fs_value_is_string Value is a string
+ fs_value_is_blob Value is a binary blob
+ fs_value_is_filename Value is a filename* + dirfd
+ fs_value_is_file Value is an open file (file*)
+ ==================== =============================
If there is a value, that value is stored in a union in the struct in one
of param->{string,blob,name,file}. Note that the function may steal and
clear the pointer, but then becomes responsible for disposing of the
object.
- (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key,
- const char *value, size_t v_size);
+ * ::
+
+ int vfs_parse_fs_string(struct fs_context *fc, const char *key,
+ const char *value, size_t v_size);
A wrapper around vfs_parse_fs_param() that copies the value string it is
passed.
- (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+ * ::
+
+ int generic_parse_monolithic(struct fs_context *fc, void *data);
Parse a sys_mount() data page, assuming the form to be a text list
consisting of key[=val] options separated by commas. Each item in the
list is passed to vfs_mount_option(). This is the default when the
->parse_monolithic() method is NULL.
- (*) int vfs_get_tree(struct fs_context *fc);
+ * ::
+
+ int vfs_get_tree(struct fs_context *fc);
Get or create the mountable root and superblock, using the parameters in
the filesystem context to select/configure the superblock. This invokes
the ->get_tree() method.
- (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+ * ::
+
+ struct vfsmount *vfs_create_mount(struct fs_context *fc);
Create a mount given the parameters in the specified filesystem context.
Note that this does not attach the mount to anything.
-===========================
-SUPERBLOCK CREATION HELPERS
+Superblock Creation Helpers
===========================
A number of VFS helpers are available for use by filesystems for the creation
or looking up of superblocks.
- (*) struct super_block *
- sget_fc(struct fs_context *fc,
- int (*test)(struct super_block *sb, struct fs_context *fc),
- int (*set)(struct super_block *sb, struct fs_context *fc));
+ * ::
+
+ struct super_block *
+ sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *sb, struct fs_context *fc),
+ int (*set)(struct super_block *sb, struct fs_context *fc));
This is the core routine. If test is non-NULL, it searches for an
existing superblock matching the criteria held in the fs_context, using
@@ -482,15 +561,6 @@ or looking up of superblocks.
The following helpers all wrap sget_fc():
- (*) int vfs_get_super(struct fs_context *fc,
- enum vfs_get_super_keying keying,
- int (*fill_super)(struct super_block *sb,
- struct fs_context *fc))
-
- This creates/looks up a deviceless superblock. The keying indicates how
- many superblocks of this type may exist and in what manner they may be
- shared:
-
(1) vfs_get_single_super
Only one such superblock may exist in the system. Any further
@@ -510,19 +580,18 @@ The following helpers all wrap sget_fc():
one.
-=====================
-PARAMETER DESCRIPTION
+Parameter Description
=====================
Parameters are described using structures defined in linux/fs_parser.h.
-There's a core description struct that links everything together:
+There's a core description struct that links everything together::
struct fs_parameter_description {
const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums;
};
-For example:
+For example::
enum {
Opt_autocell,
@@ -539,10 +608,12 @@ For example:
The members are as follows:
- (1) const struct fs_parameter_specification *specs;
+ (1) ::
+
+ const struct fs_parameter_specification *specs;
Table of parameter specifications, terminated with a null entry, where the
- entries are of type:
+ entries are of type::
struct fs_parameter_spec {
const char *name;
@@ -558,6 +629,7 @@ The members are as follows:
The 'type' field indicates the desired value type and must be one of:
+ ======================= ======================= =====================
TYPE NAME EXPECTED VALUE RESULT IN
======================= ======================= =====================
fs_param_is_flag No value n/a
@@ -573,19 +645,23 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ ======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
to match any string value against "0", "1", "no", "yes", "false", "true".
Each parameter can also be qualified with 'flags':
+ ======================= ================================================
fs_param_v_optional The value is optional
fs_param_neg_with_no result->negated set if key is prefixed with "no"
fs_param_neg_with_empty result->negated set if value is ""
fs_param_deprecated The parameter is deprecated.
+ ======================= ================================================
These are wrapped with a number of convenience wrappers:
+ ======================= ===============================================
MACRO SPECIFIES
======================= ===============================================
fsparam_flag() fs_param_is_flag
@@ -602,9 +678,10 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ ======================= ===============================================
all of which take two arguments, name string and option number - for
- example:
+ example::
static const struct fs_parameter_spec afs_param_specs[] = {
fsparam_flag ("autocell", Opt_autocell),
@@ -618,10 +695,12 @@ The members are as follows:
of arguments to specify the type and the flags for anything that doesn't
match one of the above macros.
- (2) const struct fs_parameter_enum *enums;
+ (2) ::
+
+ const struct fs_parameter_enum *enums;
Table of enum value names to integer mappings, terminated with a null
- entry. This is of type:
+ entry. This is of type::
struct fs_parameter_enum {
u8 opt;
@@ -630,7 +709,7 @@ The members are as follows:
};
Where the array is an unsorted list of { parameter ID, name }-keyed
- elements that indicate the value to map to, e.g.:
+ elements that indicate the value to map to, e.g.::
static const struct fs_parameter_enum afs_param_enums[] = {
{ Opt_bar, "x", 1},
@@ -648,18 +727,19 @@ CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
userspace using the fsinfo() syscall.
-==========================
-PARAMETER HELPER FUNCTIONS
+Parameter Helper Functions
==========================
A number of helper functions are provided to help a filesystem or an LSM
process the parameters it is given.
- (*) int lookup_constant(const struct constant_table tbl[],
- const char *name, int not_found);
+ * ::
+
+ int lookup_constant(const struct constant_table tbl[],
+ const char *name, int not_found);
Look up a constant by name in a table of name -> integer mappings. The
- table is an array of elements of the following type:
+ table is an array of elements of the following type::
struct constant_table {
const char *name;
@@ -669,9 +749,11 @@ process the parameters it is given.
If a match is found, the corresponding value is returned. If a match
isn't found, the not_found value is returned instead.
- (*) bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
+ * ::
+
+ bool validate_constant_table(const struct constant_table *tbl,
+ size_t tbl_size,
+ int low, int high, int special);
Validate a constant table. Checks that all the elements are appropriately
ordered, that there are no duplicates and that the values are between low
@@ -680,18 +762,22 @@ process the parameters it is given.
should just be set to lie inside the low-to-high range.
If all is good, true is returned. If the table is invalid, errors are
- logged to dmesg and false is returned.
+ logged to the kernel log buffer and false is returned.
- (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+ * ::
+
+ bool fs_validate_description(const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
- log errors to dmesg if validation fails.
+ log errors to the kernel log buffer if validation fails.
+
+ * ::
- (*) int fs_parse(struct fs_context *fc,
- const struct fs_parameter_description *desc,
- struct fs_parameter *param,
- struct fs_parse_result *result);
+ int fs_parse(struct fs_context *fc,
+ const struct fs_parameter_description *desc,
+ struct fs_parameter *param,
+ struct fs_parse_result *result);
This is the main interpreter of parameters. It uses the parameter
description to look up a parameter by key name and to convert that to an
@@ -711,14 +797,17 @@ process the parameters it is given.
parameter is matched, but the value is erroneous, -EINVAL will be
returned; otherwise the parameter's option number will be returned.
- (*) int fs_lookup_param(struct fs_context *fc,
- struct fs_parameter *value,
- bool want_bdev,
- struct path *_path);
+ * ::
+
+ int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *value,
+ bool want_bdev,
+ unsigned int flags,
+ struct path *_path);
This takes a parameter that carries a string or filename type and attempts
to do a path lookup on it. If the parameter expects a blockdev, a check
is made that the inode actually represents one.
- Returns 0 if successful and *_path will be set; returns a negative error
- code if not.
+ Returns 0 if successful and ``*_path`` will be set; returns a negative
+ error code if not.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
new file mode 100644
index 000000000000..4cc657d743f7
--- /dev/null
+++ b/Documentation/filesystems/netfs_library.rst
@@ -0,0 +1,595 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+Network Filesystem Helper Library
+=================================
+
+.. Contents:
+
+ - Overview.
+ - Per-inode context.
+ - Inode context helper functions.
+ - Buffered read helpers.
+ - Read helper functions.
+ - Read helper structures.
+ - Read helper operations.
+ - Read helper procedure.
+ - Read helper cache API.
+
+
+Overview
+========
+
+The network filesystem helper library is a set of functions designed to aid a
+network filesystem in implementing VM/VFS operations. For the moment, that
+just includes turning various VM buffered read operations into requests to read
+from the server. The helper library, however, can also interpose other
+services, such as local caching or local data encryption.
+
+Note that the library module doesn't link against local caching directly, so
+access must be provided by the netfs.
+
+
+Per-Inode Context
+=================
+
+The network filesystem helper library needs a place to store a bit of state for
+its use on each netfs inode it is helping to manage. To this end, a context
+structure is defined::
+
+ struct netfs_inode {
+ struct inode inode;
+ const struct netfs_request_ops *ops;
+ struct fscache_cookie *cache;
+ };
+
+A network filesystem that wants to use netfs lib must place one of these in its
+inode wrapper struct instead of the VFS ``struct inode``. This can be done in
+a way similar to the following::
+
+ struct my_inode {
+ struct netfs_inode netfs; /* Netfslib context and vfs inode */
+ ...
+ };
+
+This allows netfslib to find its state by using ``container_of()`` from the
+inode pointer, thereby allowing the netfslib helper functions to be pointed to
+directly by the VFS/VM operation tables.
+
+The structure contains the following fields:
+
+ * ``inode``
+
+ The VFS inode structure.
+
+ * ``ops``
+
+ The set of operations provided by the network filesystem to netfslib.
+
+ * ``cache``
+
+ Local caching cookie, or NULL if no caching is enabled. This field does not
+ exist if fscache is disabled.
+
+
+Inode Context Helper Functions
+------------------------------
+
+To help deal with the per-inode context, a number helper functions are
+provided. Firstly, a function to perform basic initialisation on a context and
+set the operations table pointer::
+
+ void netfs_inode_init(struct netfs_inode *ctx,
+ const struct netfs_request_ops *ops);
+
+then a function to cast from the VFS inode structure to the netfs context::
+
+ struct netfs_inode *netfs_node(struct inode *inode);
+
+and finally, a function to get the cache cookie pointer from the context
+attached to an inode (or NULL if fscache is disabled)::
+
+ struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+
+
+Buffered Read Helpers
+=====================
+
+The library provides a set of read helpers that handle the ->read_folio(),
+->readahead() and much of the ->write_begin() VM operations and translate them
+into a common call framework.
+
+The following services are provided:
+
+ * Handle folios that span multiple pages.
+
+ * Insulate the netfs from VM interface changes.
+
+ * Allow the netfs to arbitrarily split reads up into pieces, even ones that
+ don't match folio sizes or folio alignments and that may cross folios.
+
+ * Allow the netfs to expand a readahead request in both directions to meet its
+ needs.
+
+ * Allow the netfs to partially fulfil a read, which will then be resubmitted.
+
+ * Handle local caching, allowing cached data and server-read data to be
+ interleaved for a single request.
+
+ * Handle clearing of bufferage that aren't on the server.
+
+ * Handle retrying of reads that failed, switching reads from the cache to the
+ server as necessary.
+
+ * In the future, this is a place that other services can be performed, such as
+ local encryption of data to be stored remotely or in the cache.
+
+From the network filesystem, the helpers require a table of operations. This
+includes a mandatory method to issue a read operation along with a number of
+optional methods.
+
+
+Read Helper Functions
+---------------------
+
+Three read helpers are provided::
+
+ void netfs_readahead(struct readahead_control *ractl);
+ int netfs_read_folio(struct file *file,
+ struct folio *folio);
+ int netfs_write_begin(struct netfs_inode *ctx,
+ struct file *file,
+ struct address_space *mapping,
+ loff_t pos,
+ unsigned int len,
+ struct folio **_folio,
+ void **_fsdata);
+
+Each corresponds to a VM address space operation. These operations use the
+state in the per-inode context.
+
+For ->readahead() and ->read_folio(), the network filesystem just point directly
+at the corresponding read helper; whereas for ->write_begin(), it may be a
+little more complicated as the network filesystem might want to flush
+conflicting writes or track dirty data and needs to put the acquired folio if
+an error occurs after calling the helper.
+
+The helpers manage the read request, calling back into the network filesystem
+through the supplied table of operations. Waits will be performed as
+necessary before returning for helpers that are meant to be synchronous.
+
+If an error occurs, the ->free_request() will be called to clean up the
+netfs_io_request struct allocated. If some parts of the request are in
+progress when an error occurs, the request will get partially completed if
+sufficient data is read.
+
+Additionally, there is::
+
+ * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
+ ssize_t transferred_or_error,
+ bool was_async);
+
+which should be called to complete a read subrequest. This is given the number
+of bytes transferred or a negative error code, plus a flag indicating whether
+the operation was asynchronous (ie. whether the follow-on processing can be
+done in the current context, given this may involve sleeping).
+
+
+Read Helper Structures
+----------------------
+
+The read helpers make use of a couple of structures to maintain the state of
+the read. The first is a structure that manages a read request as a whole::
+
+ struct netfs_io_request {
+ struct inode *inode;
+ struct address_space *mapping;
+ struct netfs_cache_resources cache_resources;
+ void *netfs_priv;
+ loff_t start;
+ size_t len;
+ loff_t i_size;
+ const struct netfs_request_ops *netfs_ops;
+ unsigned int debug_id;
+ ...
+ };
+
+The above fields are the ones the netfs can use. They are:
+
+ * ``inode``
+ * ``mapping``
+
+ The inode and the address space of the file being read from. The mapping
+ may or may not point to inode->i_data.
+
+ * ``cache_resources``
+
+ Resources for the local cache to use, if present.
+
+ * ``netfs_priv``
+
+ The network filesystem's private data. The value for this can be passed in
+ to the helper functions or set during the request.
+
+ * ``start``
+ * ``len``
+
+ The file position of the start of the read request and the length. These
+ may be altered by the ->expand_readahead() op.
+
+ * ``i_size``
+
+ The size of the file at the start of the request.
+
+ * ``netfs_ops``
+
+ A pointer to the operation table. The value for this is passed into the
+ helper functions.
+
+ * ``debug_id``
+
+ A number allocated to this operation that can be displayed in trace lines
+ for reference.
+
+
+The second structure is used to manage individual slices of the overall read
+request::
+
+ struct netfs_io_subrequest {
+ struct netfs_io_request *rreq;
+ loff_t start;
+ size_t len;
+ size_t transferred;
+ unsigned long flags;
+ unsigned short debug_index;
+ ...
+ };
+
+Each subrequest is expected to access a single source, though the helpers will
+handle falling back from one source type to another. The members are:
+
+ * ``rreq``
+
+ A pointer to the read request.
+
+ * ``start``
+ * ``len``
+
+ The file position of the start of this slice of the read request and the
+ length.
+
+ * ``transferred``
+
+ The amount of data transferred so far of the length of this slice. The
+ network filesystem or cache should start the operation this far into the
+ slice. If a short read occurs, the helpers will call again, having updated
+ this to reflect the amount read so far.
+
+ * ``flags``
+
+ Flags pertaining to the read. There are two of interest to the filesystem
+ or cache:
+
+ * ``NETFS_SREQ_CLEAR_TAIL``
+
+ This can be set to indicate that the remainder of the slice, from
+ transferred to len, should be cleared.
+
+ * ``NETFS_SREQ_SEEK_DATA_READ``
+
+ This is a hint to the cache that it might want to try skipping ahead to
+ the next data (ie. using SEEK_DATA).
+
+ * ``debug_index``
+
+ A number allocated to this slice that can be displayed in trace lines for
+ reference.
+
+
+Read Helper Operations
+----------------------
+
+The network filesystem must provide the read helpers with a table of operations
+through which it can issue requests and negotiate::
+
+ struct netfs_request_ops {
+ void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ void (*free_request)(struct netfs_io_request *rreq);
+ void (*expand_readahead)(struct netfs_io_request *rreq);
+ bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+ void (*issue_read)(struct netfs_io_subrequest *subreq);
+ bool (*is_still_valid)(struct netfs_io_request *rreq);
+ int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
+ struct folio **foliop, void **_fsdata);
+ void (*done)(struct netfs_io_request *rreq);
+ };
+
+The operations are as follows:
+
+ * ``init_request()``
+
+ [Optional] This is called to initialise the request structure. It is given
+ the file for reference.
+
+ * ``free_request()``
+
+ [Optional] This is called as the request is being deallocated so that the
+ filesystem can clean up any state it has attached there.
+
+ * ``expand_readahead()``
+
+ [Optional] This is called to allow the filesystem to expand the size of a
+ readahead read request. The filesystem gets to expand the request in both
+ directions, though it's not permitted to reduce it as the numbers may
+ represent an allocation already made. If local caching is enabled, it gets
+ to expand the request first.
+
+ Expansion is communicated by changing ->start and ->len in the request
+ structure. Note that if any change is made, ->len must be increased by at
+ least as much as ->start is reduced.
+
+ * ``clamp_length()``
+
+ [Optional] This is called to allow the filesystem to reduce the size of a
+ subrequest. The filesystem can use this, for example, to chop up a request
+ that has to be split across multiple servers or to put multiple reads in
+ flight.
+
+ This should return 0 on success and an error code on error.
+
+ * ``issue_read()``
+
+ [Required] The helpers use this to dispatch a subrequest to the server for
+ reading. In the subrequest, ->start, ->len and ->transferred indicate what
+ data should be read from the server.
+
+ There is no return value; the netfs_subreq_terminated() function should be
+ called to indicate whether or not the operation succeeded and how much data
+ it transferred. The filesystem also should not deal with setting folios
+ uptodate, unlocking them or dropping their refs - the helpers need to deal
+ with this as they have to coordinate with copying to the local cache.
+
+ Note that the helpers have the folios locked, but not pinned. It is
+ possible to use the ITER_XARRAY iov iterator to refer to the range of the
+ inode that is being operated upon without the need to allocate large bvec
+ tables.
+
+ * ``is_still_valid()``
+
+ [Optional] This is called to find out if the data just read from the local
+ cache is still valid. It should return true if it is still valid and false
+ if not. If it's not still valid, it will be reread from the server.
+
+ * ``check_write_begin()``
+
+ [Optional] This is called from the netfs_write_begin() helper once it has
+ allocated/grabbed the folio to be modified to allow the filesystem to flush
+ conflicting state before allowing it to be modified.
+
+ It may unlock and discard the folio it was given and set the caller's folio
+ pointer to NULL. It should return 0 if everything is now fine (``*foliop``
+ left set) or the op should be retried (``*foliop`` cleared) and any other
+ error code to abort the operation.
+
+ * ``done``
+
+ [Optional] This is called after the folios in the request have all been
+ unlocked (and marked uptodate if applicable).
+
+
+
+Read Helper Procedure
+---------------------
+
+The read helpers work by the following general procedure:
+
+ * Set up the request.
+
+ * For readahead, allow the local cache and then the network filesystem to
+ propose expansions to the read request. This is then proposed to the VM.
+ If the VM cannot fully perform the expansion, a partially expanded read will
+ be performed, though this may not get written to the cache in its entirety.
+
+ * Loop around slicing chunks off of the request to form subrequests:
+
+ * If a local cache is present, it gets to do the slicing, otherwise the
+ helpers just try to generate maximal slices.
+
+ * The network filesystem gets to clamp the size of each slice if it is to be
+ the source. This allows rsize and chunking to be implemented.
+
+ * The helpers issue a read from the cache or a read from the server or just
+ clears the slice as appropriate.
+
+ * The next slice begins at the end of the last one.
+
+ * As slices finish being read, they terminate.
+
+ * When all the subrequests have terminated, the subrequests are assessed and
+ any that are short or have failed are reissued:
+
+ * Failed cache requests are issued against the server instead.
+
+ * Failed server requests just fail.
+
+ * Short reads against either source will be reissued against that source
+ provided they have transferred some more data:
+
+ * The cache may need to skip holes that it can't do DIO from.
+
+ * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
+ end of the slice instead of reissuing.
+
+ * Once the data is read, the folios that have been fully read/cleared:
+
+ * Will be marked uptodate.
+
+ * If a cache is present, will be marked with PG_fscache.
+
+ * Unlocked
+
+ * Any folios that need writing to the cache will then have DIO writes issued.
+
+ * Synchronous operations will wait for reading to be complete.
+
+ * Writes to the cache will proceed asynchronously and the folios will have the
+ PG_fscache mark removed when that completes.
+
+ * The request structures will be cleaned up when everything has completed.
+
+
+Read Helper Cache API
+---------------------
+
+When implementing a local cache to be used by the read helpers, two things are
+required: some way for the network filesystem to initialise the caching for a
+read request and a table of operations for the helpers to call.
+
+To begin a cache operation on an fscache object, the following function is
+called::
+
+ int fscache_begin_read_operation(struct netfs_io_request *rreq,
+ struct fscache_cookie *cookie);
+
+passing in the request pointer and the cookie corresponding to the file. This
+fills in the cache resources mentioned below.
+
+The netfs_io_request object contains a place for the cache to hang its
+state::
+
+ struct netfs_cache_resources {
+ const struct netfs_cache_ops *ops;
+ void *cache_priv;
+ void *cache_priv2;
+ };
+
+This contains an operations table pointer and two private pointers. The
+operation table looks like the following::
+
+ struct netfs_cache_ops {
+ void (*end_operation)(struct netfs_cache_resources *cres);
+
+ void (*expand_readahead)(struct netfs_cache_resources *cres,
+ loff_t *_start, size_t *_len, loff_t i_size);
+
+ enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
+ loff_t i_size);
+
+ int (*read)(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ bool seek_data,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+
+ int (*prepare_write)(struct netfs_cache_resources *cres,
+ loff_t *_start, size_t *_len, loff_t i_size,
+ bool no_space_allocated_yet);
+
+ int (*write)(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+
+ int (*query_occupancy)(struct netfs_cache_resources *cres,
+ loff_t start, size_t len, size_t granularity,
+ loff_t *_data_start, size_t *_data_len);
+ };
+
+With a termination handler function pointer::
+
+ typedef void (*netfs_io_terminated_t)(void *priv,
+ ssize_t transferred_or_error,
+ bool was_async);
+
+The methods defined in the table are:
+
+ * ``end_operation()``
+
+ [Required] Called to clean up the resources at the end of the read request.
+
+ * ``expand_readahead()``
+
+ [Optional] Called at the beginning of a netfs_readahead() operation to allow
+ the cache to expand a request in either direction. This allows the cache to
+ size the request appropriately for the cache granularity.
+
+ The function is passed poiners to the start and length in its parameters,
+ plus the size of the file for reference, and adjusts the start and length
+ appropriately. It should return one of:
+
+ * ``NETFS_FILL_WITH_ZEROES``
+ * ``NETFS_DOWNLOAD_FROM_SERVER``
+ * ``NETFS_READ_FROM_CACHE``
+ * ``NETFS_INVALID_READ``
+
+ to indicate whether the slice should just be cleared or whether it should be
+ downloaded from the server or read from the cache - or whether slicing
+ should be given up at the current point.
+
+ * ``prepare_read()``
+
+ [Required] Called to configure the next slice of a request. ->start and
+ ->len in the subrequest indicate where and how big the next slice can be;
+ the cache gets to reduce the length to match its granularity requirements.
+
+ * ``read()``
+
+ [Required] Called to read from the cache. The start file offset is given
+ along with an iterator to read to, which gives the length also. It can be
+ given a hint requesting that it seek forward from that start position for
+ data.
+
+ Also provided is a pointer to a termination handler function and private
+ data to pass to that function. The termination function should be called
+ with the number of bytes transferred or an error code, plus a flag
+ indicating whether the termination is definitely happening in the caller's
+ context.
+
+ * ``prepare_write()``
+
+ [Required] Called to prepare a write to the cache to take place. This
+ involves checking to see whether the cache has sufficient space to honour
+ the write. ``*_start`` and ``*_len`` indicate the region to be written; the
+ region can be shrunk or it can be expanded to a page boundary either way as
+ necessary to align for direct I/O. i_size holds the size of the object and
+ is provided for reference. no_space_allocated_yet is set to true if the
+ caller is certain that no data has been written to that region - for example
+ if it tried to do a read from there already.
+
+ * ``write()``
+
+ [Required] Called to write to the cache. The start file offset is given
+ along with an iterator to write from, which gives the length also.
+
+ Also provided is a pointer to a termination handler function and private
+ data to pass to that function. The termination function should be called
+ with the number of bytes transferred or an error code, plus a flag
+ indicating whether the termination is definitely happening in the caller's
+ context.
+
+ * ``query_occupancy()``
+
+ [Required] Called to find out where the next piece of data is within a
+ particular region of the cache. The start and length of the region to be
+ queried are passed in, along with the granularity to which the answer needs
+ to be aligned. The function passes back the start and length of the data,
+ if any, available within that region. Note that there may be a hole at the
+ front.
+
+ It returns 0 if some data was found, -ENODATA if there was no usable data
+ within the region or -ENOBUFS if there is no caching on this file.
+
+Note that these methods are passed a pointer to the cache resource structure,
+not the read request structure as they could be used in other situations where
+there isn't a read request structure as well, such as writing dirty data to the
+cache.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: include/linux/netfs.h
+.. kernel-doc:: fs/netfs/buffered_read.c
+.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst
new file mode 100644
index 000000000000..4804441155f5
--- /dev/null
+++ b/Documentation/filesystems/nfs/client-identifier.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+NFSv4 client identifier
+=======================
+
+This document explains how the NFSv4 protocol identifies client
+instances in order to maintain file open and lock state during
+system restarts. A special identifier and principal are maintained
+on each client. These can be set by administrators, scripts
+provided by site administrators, or tools provided by Linux
+distributors.
+
+There are risks if a client's NFSv4 identifier and its principal
+are not chosen carefully.
+
+
+Introduction
+------------
+
+The NFSv4 protocol uses "lease-based file locking". Leases help
+NFSv4 servers provide file lock guarantees and manage their
+resources.
+
+Simply put, an NFSv4 server creates a lease for each NFSv4 client.
+The server collects each client's file open and lock state under
+the lease for that client.
+
+The client is responsible for periodically renewing its leases.
+While a lease remains valid, the server holding that lease
+guarantees the file locks the client has created remain in place.
+
+If a client stops renewing its lease (for example, if it crashes),
+the NFSv4 protocol allows the server to remove the client's open
+and lock state after a certain period of time. When a client
+restarts, it indicates to servers that open and lock state
+associated with its previous leases is no longer valid and can be
+destroyed immediately.
+
+In addition, each NFSv4 server manages a persistent list of client
+leases. When the server restarts and clients attempt to recover
+their state, the server uses this list to distinguish amongst
+clients that held state before the server restarted and clients
+sending fresh OPEN and LOCK requests. This enables file locks to
+persist safely across server restarts.
+
+NFSv4 client identifiers
+------------------------
+
+Each NFSv4 client presents an identifier to NFSv4 servers so that
+they can associate the client with its lease. Each client's
+identifier consists of two elements:
+
+ - co_ownerid: An arbitrary but fixed string.
+
+ - boot verifier: A 64-bit incarnation verifier that enables a
+ server to distinguish successive boot epochs of the same client.
+
+The NFSv4.0 specification refers to these two items as an
+"nfs_client_id4". The NFSv4.1 specification refers to these two
+items as a "client_owner4".
+
+NFSv4 servers tie this identifier to the principal and security
+flavor that the client used when presenting it. Servers use this
+principal to authorize subsequent lease modification operations
+sent by the client. Effectively this principal is a third element of
+the identifier.
+
+As part of the identity presented to servers, a good
+"co_ownerid" string has several important properties:
+
+ - The "co_ownerid" string identifies the client during reboot
+ recovery, therefore the string is persistent across client
+ reboots.
+ - The "co_ownerid" string helps servers distinguish the client
+ from others, therefore the string is globally unique. Note
+ that there is no central authority that assigns "co_ownerid"
+ strings.
+ - Because it often appears on the network in the clear, the
+ "co_ownerid" string does not reveal private information about
+ the client itself.
+ - The content of the "co_ownerid" string is set and unchanging
+ before the client attempts NFSv4 mounts after a restart.
+ - The NFSv4 protocol places a 1024-byte limit on the size of the
+ "co_ownerid" string.
+
+Protecting NFSv4 lease state
+----------------------------
+
+NFSv4 servers utilize the "client_owner4" as described above to
+assign a unique lease to each client. Under this scheme, there are
+circumstances where clients can interfere with each other. This is
+referred to as "lease stealing".
+
+If distinct clients present the same "co_ownerid" string and use
+the same principal (for example, AUTH_SYS and UID 0), a server is
+unable to tell that the clients are not the same. Each distinct
+client presents a different boot verifier, so it appears to the
+server as if there is one client that is rebooting frequently.
+Neither client can maintain open or lock state in this scenario.
+
+If distinct clients present the same "co_ownerid" string and use
+distinct principals, the server is likely to allow the first client
+to operate normally but reject subsequent clients with the same
+"co_ownerid" string.
+
+If a client's "co_ownerid" string or principal are not stable,
+state recovery after a server or client reboot is not guaranteed.
+If a client unexpectedly restarts but presents a different
+"co_ownerid" string or principal to the server, the server orphans
+the client's previous open and lock state. This blocks access to
+locked files until the server removes the orphaned state.
+
+If the server restarts and a client presents a changed "co_ownerid"
+string or principal to the server, the server will not allow the
+client to reclaim its open and lock state, and may give those locks
+to other clients in the meantime. This is referred to as "lock
+stealing".
+
+Lease stealing and lock stealing increase the potential for denial
+of service and in rare cases even data corruption.
+
+Selecting an appropriate client identifier
+------------------------------------------
+
+By default, the Linux NFSv4 client implementation constructs its
+"co_ownerid" string starting with the words "Linux NFS" followed by
+the client's UTS node name (the same node name, incidentally, that
+is used as the "machine name" in an AUTH_SYS credential). In small
+deployments, this construction is usually adequate. Often, however,
+the node name by itself is not adequately unique, and can change
+unexpectedly. Problematic situations include:
+
+ - NFS-root (diskless) clients, where the local DHCP server (or
+ equivalent) does not provide a unique host name.
+
+ - "Containers" within a single Linux host. If each container has
+ a separate network namespace, but does not use the UTS namespace
+ to provide a unique host name, then there can be multiple NFS
+ client instances with the same host name.
+
+ - Clients across multiple administrative domains that access a
+ common NFS server. If hostnames are not assigned centrally
+ then uniqueness cannot be guaranteed unless a domain name is
+ included in the hostname.
+
+Linux provides two mechanisms to add uniqueness to its "co_ownerid"
+string:
+
+ nfs.nfs4_unique_id
+ This module parameter can set an arbitrary uniquifier string
+ via the kernel command line, or when the "nfs" module is
+ loaded.
+
+ /sys/fs/nfs/net/nfs_client/identifier
+ This virtual file, available since Linux 5.3, is local to the
+ network namespace in which it is accessed and so can provide
+ distinction between network namespaces (containers) when the
+ hostname remains uniform.
+
+Note that this file is empty on name-space creation. If the
+container system has access to some sort of per-container identity
+then that uniquifier can be used. For example, a uniquifier might
+be formed at boot using the container's internal identifier:
+
+ sha256sum /etc/machine-id | awk '{print $1}' \\
+ > /sys/fs/nfs/net/nfs_client/identifier
+
+Security considerations
+-----------------------
+
+The use of cryptographic security for lease management operations
+is strongly encouraged.
+
+If NFS with Kerberos is not configured, a Linux NFSv4 client uses
+AUTH_SYS and UID 0 as the principal part of its client identity.
+This configuration is not only insecure, it increases the risk of
+lease and lock stealing. However, it might be the only choice for
+client configurations that have no local persistent storage.
+"co_ownerid" string uniqueness and persistence is critical in this
+case.
+
+When a Kerberos keytab is present on a Linux NFS client, the client
+attempts to use one of the principals in that keytab when
+identifying itself to servers. The "sec=" mount option does not
+control this behavior. Alternately, a single-user client with a
+Kerberos principal can use that principal in place of the client's
+host principal.
+
+Using Kerberos for this purpose enables the client and server to
+use the same lease for operations covered by all "sec=" settings.
+Additionally, the Linux NFS client uses the RPCSEC_GSS security
+flavor with Kerberos and the integrity QOS to prevent in-transit
+modification of lease modification requests.
+
+Additional notes
+----------------
+The Linux NFSv4 client establishes a single lease on each NFSv4
+server it accesses. NFSv4 mounts from a Linux NFSv4 client of a
+particular server then share that lease.
+
+Once a client establishes open and lock state, the NFSv4 protocol
+enables lease state to transition to other servers, following data
+that has been migrated. This hides data migration completely from
+running applications. The Linux NFSv4 client facilitates state
+migration by presenting the same "client_owner4" to all servers it
+encounters.
+
+========
+See Also
+========
+
+ - nfs(5)
+ - kerberos(7)
+ - RFC 7530 for the NFSv4.0 specification
+ - RFC 8881 for the NFSv4.1 specification.
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 33d588a01ace..f04ce1215a03 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
- encode_fh (optional)
- Takes a dentry and creates a filehandle fragment which can later be used
- to find or create a dentry for the same object. The default
- implementation creates a filehandle fragment that encodes a 32bit inode
- and generation number for the inode encoded, and if necessary the
- same information for the parent.
+ encode_fh (mandatory)
+ Takes a dentry and creates a filehandle fragment which may later be used
+ to find or create a dentry for the same object.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
@@ -154,6 +151,11 @@ struct which has the following members:
to find potential names, and matches inode numbers to find the correct
match.
+ flags
+ Some filesystems may need to be handled differently than others. The
+ export_operations struct also includes a flags field that allows the
+ filesystem to communicate such information to nfsd. See the Export
+ Operations Flags section below for more explanation.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
@@ -163,3 +165,83 @@ generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.
+
+Export Operations Flags
+-----------------------
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+ EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
+ RFC 1813 recommends that servers always send weak cache consistency
+ (WCC) data to the client after each operation. The server should
+ atomically collect attributes about the inode, do an operation on it,
+ and then collect the attributes afterward. This allows the client to
+ skip issuing GETATTRs in some situations but means that the server
+ is calling vfs_getattr for almost all RPCs. On some filesystems
+ (particularly those that are clustered or networked) this is expensive
+ and atomicity is difficult to guarantee. This flag indicates to nfsd
+ that it should skip providing WCC attributes to the client in NFSv3
+ replies when doing operations on this filesystem. Consider enabling
+ this on filesystems that have an expensive ->getattr inode operation,
+ or when atomicity between pre and post operation attribute collection
+ is impossible to guarantee.
+
+ EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
+ Many NFS operations deal with filehandles, which the server must then
+ vet to ensure that they live inside of an exported tree. When the
+ export consists of an entire filesystem, this is trivial. nfsd can just
+ ensure that the filehandle live on the filesystem. When only part of a
+ filesystem is exported however, then nfsd must walk the ancestors of the
+ inode to ensure that it's within an exported subtree. This is an
+ expensive operation and not all filesystems can support it properly.
+ This flag exempts the filesystem from subtree checking and causes
+ exportfs to get back an error if it tries to enable subtree checking
+ on it.
+
+ EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+ On some exportable filesystems (such as NFS) unlinking a file that
+ is still open can cause a fair bit of extra work. For instance,
+ the NFS client will do a "sillyrename" to ensure that the file
+ sticks around while it's still open. When reexporting, that open
+ file is held by nfsd so we usually end up doing a sillyrename, and
+ then immediately deleting the sillyrenamed file just afterward when
+ the link count actually goes to zero. Sometimes this delete can race
+ with other operations (for instance an rmdir of the parent directory).
+ This flag causes nfsd to close any open files for this inode _before_
+ calling into the vfs to do an unlink or a rename that would replace
+ an existing file.
+
+ EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
+ PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to
+ write to one bdi (the final bdi) in order to free up writes queued
+ to another bdi (the client bdi). Such threads get a private balance
+ of dirty pages so that dirty pages for the client bdi do not imact
+ the daemon writing to the final bdi. For filesystems whose durable
+ storage is not local (such as exported NFS filesystems), this
+ constraint has negative consequences. EXPORT_OP_REMOTE_FS enables
+ an export to disable writeback throttling.
+
+ EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
+ EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem
+ cannot provide the semantics required by the "atomic" boolean in
+ NFSv4's change_info4. This boolean indicates to a client whether the
+ returned before and after change attributes were obtained atomically
+ with the respect to the requested metadata operation (UNLINK,
+ OPEN/CREATE, MKDIR, etc).
+
+ EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
+ On most filesystems, inodes can remain under writeback after the
+ file is closed. NFSD relies on client activity or local flusher
+ threads to handle writeback. Certain filesystems, such as NFS, flush
+ all of an inode's dirty data on last close. Exports that behave this
+ way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
+ waiting for writeback when closing such files.
+
+ EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
+ requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
+ it's own ->lock() functionality as core posix_lock_file() implementation
+ has no async lock request handling yet. For more information about how to
+ indicate an async lock request from a ->lock() file_operations struct, see
+ fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 65805624e39b..8536134f31fd 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -6,8 +6,11 @@ NFS
.. toctree::
:maxdepth: 1
+ client-identifier
+ exporting
pnfs
rpc-cache
rpc-server-gss
nfs41-server
knfsd-stats
+ reexport
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
new file mode 100644
index 000000000000..ff9ae4a46530
--- /dev/null
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -0,0 +1,113 @@
+Reexporting NFS filesystems
+===========================
+
+Overview
+--------
+
+It is possible to reexport an NFS filesystem over NFS. However, this
+feature comes with a number of limitations. Before trying it, we
+recommend some careful research to determine whether it will work for
+your purposes.
+
+A discussion of current known limitations follows.
+
+"fsid=" required, crossmnt broken
+---------------------------------
+
+We require the "fsid=" export option on any reexport of an NFS
+filesystem. You can use "uuidgen -r" to generate a unique argument.
+
+The "crossmnt" export does not propagate "fsid=", so it will not allow
+traversing into further nfs filesystems; if you wish to export nfs
+filesystems mounted under the exported filesystem, you'll need to export
+them explicitly, assigning each its own unique "fsid= option.
+
+Reboot recovery
+---------------
+
+The NFS protocol's normal reboot recovery mechanisms don't work for the
+case when the reexport server reboots. Clients will lose any locks
+they held before the reboot, and further IO will result in errors.
+Closing and reopening files should clear the errors.
+
+Filehandle limits
+-----------------
+
+If the original server uses an X byte filehandle for a given object, the
+reexport server's filehandle for the reexported object will be X+22
+bytes, rounded up to the nearest multiple of four bytes.
+
+The result must fit into the RFC-mandated filehandle size limits:
+
++-------+-----------+
+| NFSv2 | 32 bytes |
++-------+-----------+
+| NFSv3 | 64 bytes |
++-------+-----------+
+| NFSv4 | 128 bytes |
++-------+-----------+
+
+So, for example, you will only be able to reexport a filesystem over
+NFSv2 if the original server gives you filehandles that fit in 10
+bytes--which is unlikely.
+
+In general there's no way to know the maximum filehandle size given out
+by an NFS server without asking the server vendor.
+
+But the following table gives a few examples. The first column is the
+typical length of the filehandle from a Linux server exporting the given
+filesystem, the second is the length after that nfs export is reexported
+by another Linux host:
+
++--------+-------------------+----------------+
+| | filehandle length | after reexport |
++========+===================+================+
+| ext4: | 28 bytes | 52 bytes |
++--------+-------------------+----------------+
+| xfs: | 32 bytes | 56 bytes |
++--------+-------------------+----------------+
+| btrfs: | 40 bytes | 64 bytes |
++--------+-------------------+----------------+
+
+All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
+but none are reexportable over NFSv2.
+
+Linux server filehandles are a bit more complicated than this, though;
+for example:
+
+ - The (non-default) "subtreecheck" export option generally
+ requires another 4 to 8 bytes in the filehandle.
+ - If you export a subdirectory of a filesystem (instead of
+ exporting the filesystem root), that also usually adds 4 to 8
+ bytes.
+ - If you export over NFSv2, knfsd usually uses a shorter
+ filesystem identifier that saves 8 bytes.
+ - The root directory of an export uses a filehandle that is
+ shorter.
+
+As you can see, the 128-byte NFSv4 filehandle is large enough that
+you're unlikely to have trouble using NFSv4 to reexport any filesystem
+exported from a Linux server. In general, if the original server is
+something that also supports NFSv3, you're *probably* OK. Re-exporting
+over NFSv3 may be dicier, and reexporting over NFSv2 will probably
+never work.
+
+For more details of Linux filehandle structure, the best reference is
+the source code and comments; see in particular:
+
+ - include/linux/exportfs.h:enum fid_type
+ - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
+ - fs/nfsd/nfsfh.c:set_version_and_fsid_type
+ - fs/nfs/export.c:nfs_encode_fh
+
+Open DENY bits ignored
+----------------------
+
+NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
+allow you, for example, to open a file in a mode which forbids other
+read opens or write opens. The Linux client doesn't use them, and the
+server's support has always been incomplete: they are enforced only
+against other NFS users, not against processes accessing the exported
+filesystem locally. A reexport server will also not pass them along to
+the original server, so they will not be enforced between clients of
+different reexport servers.
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
index bb164eea969b..339efd75016a 100644
--- a/Documentation/filesystems/nfs/rpc-cache.rst
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -78,7 +78,7 @@ Creating a Cache
include taking references to shared objects.
void update(struct cache_head \*orig, struct cache_head \*new)
- Set the 'content' fileds in 'new' from 'orig'.
+ Set the 'content' fields in 'new' from 'orig'.
int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
Optional. Used to provide a /proc file that lists the
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
index 812754576845..5c1a1c58fc27 100644
--- a/Documentation/filesystems/nfs/rpc-server-gss.rst
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -10,13 +10,12 @@ purposes of authentication.)
RPCGSS is specified in a few IETF documents:
- - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
-and there is a 3rd version being proposed:
+There is a third version that we don't currently implement:
- - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
- (At draft n. 02 at the time of writing)
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
Background
==========
@@ -30,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and
depends on GSSAPI extensions that are KRB5 specific.
GSSAPI is a complex library, and implementing it completely in kernel is
-unwarranted. However GSSAPI operations are fundementally separable in 2
+unwarranted. However GSSAPI operations are fundamentally separable in 2
parts:
- initial context establishment
diff --git a/Documentation/filesystems/nilfs2.rst b/Documentation/filesystems/nilfs2.rst
index 6c49f04e9e0a..e3a5c8977f2c 100644
--- a/Documentation/filesystems/nilfs2.rst
+++ b/Documentation/filesystems/nilfs2.rst
@@ -231,7 +231,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
The logs include regular files, directory files, symbolic link files
-and several meta data files. The mata data files are the files used
+and several meta data files. The meta data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
the following meta data files::
diff --git a/Documentation/filesystems/ntfs3.rst b/Documentation/filesystems/ntfs3.rst
new file mode 100644
index 000000000000..2b86a9b3a6de
--- /dev/null
+++ b/Documentation/filesystems/ntfs3.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+NTFS3
+=====
+
+Summary and Features
+====================
+
+NTFS3 is fully functional NTFS Read-Write driver. The driver works with NTFS
+versions up to 3.1. File system type to use on mount is *ntfs3*.
+
+- This driver implements NTFS read/write support for normal, sparse and
+ compressed files.
+- Supports native journal replaying.
+- Supports NFS export of mounted NTFS volumes.
+- Supports extended attributes. Predefined extended attributes:
+
+ - *system.ntfs_security* gets/sets security
+
+ Descriptor: SECURITY_DESCRIPTOR_RELATIVE
+
+ - *system.ntfs_attrib* gets/sets ntfs file/dir attributes.
+
+ Note: Applied to empty files, this allows to switch type between
+ sparse(0x200), compressed(0x800) and normal.
+
+ - *system.ntfs_attrib_be* gets/sets ntfs file/dir attributes.
+
+ Same value as system.ntfs_attrib but always represent as big-endian
+ (endianness of system.ntfs_attrib is the same as of the CPU).
+
+Mount Options
+=============
+
+The list below describes mount options supported by NTFS3 driver in addition to
+generic ones. You can use every mount option with **no** option. If it is in
+this table marked with no it means default is without **no**.
+
+.. flat-table::
+ :widths: 1 5
+ :fill-cells:
+
+ * - iocharset=name
+ - This option informs the driver how to interpret path strings and
+ translate them to Unicode and back. If this option is not set, the
+ default codepage will be used (CONFIG_NLS_DEFAULT).
+
+ Example: iocharset=utf8
+
+ * - uid=
+ - :rspan:`1`
+ * - gid=
+
+ * - umask=
+ - Controls the default permissions for files/directories created after
+ the NTFS volume is mounted.
+
+ * - dmask=
+ - :rspan:`1` Instead of specifying umask which applies both to files and
+ directories, fmask applies only to files and dmask only to directories.
+ * - fmask=
+
+ * - nohidden
+ - Files with the Windows-specific HIDDEN (FILE_ATTRIBUTE_HIDDEN) attribute
+ will not be shown under Linux.
+
+ * - sys_immutable
+ - Files with the Windows-specific SYSTEM (FILE_ATTRIBUTE_SYSTEM) attribute
+ will be marked as system immutable files.
+
+ * - hide_dot_files
+ - Updates the Windows-specific HIDDEN (FILE_ATTRIBUTE_HIDDEN) attribute
+ when creating and moving or renaming files. Files whose names start
+ with a dot will have the HIDDEN attribute set and files whose names
+ do not start with a dot will have it unset.
+
+ * - windows_names
+ - Prevents the creation of files and directories with a name not allowed
+ by Windows, either because it contains some not allowed character (which
+ are the characters " * / : < > ? \\ | and those whose code is less than
+ 0x20), because the name (with or without extension) is a reserved file
+ name (CON, AUX, NUL, PRN, LPT1-9, COM1-9) or because the last character
+ is a space or a dot. Existing such files can still be read and renamed.
+
+ * - discard
+ - Enable support of the TRIM command for improved performance on delete
+ operations, which is recommended for use with the solid-state drives
+ (SSD).
+
+ * - force
+ - Forces the driver to mount partitions even if volume is marked dirty.
+ Not recommended for use.
+
+ * - sparse
+ - Create new files as sparse.
+
+ * - showmeta
+ - Use this parameter to show all meta-files (System Files) on a mounted
+ NTFS partition. By default, all meta-files are hidden.
+
+ * - prealloc
+ - Preallocate space for files excessively when file size is increasing on
+ writes. Decreases fragmentation in case of parallel write operations to
+ different files.
+
+ * - acl
+ - Support POSIX ACLs (Access Control Lists). Effective if supported by
+ Kernel. Not to be confused with NTFS ACLs. The option specified as acl
+ enables support for POSIX ACLs.
+
+Todo list
+=========
+- Full journaling support over JBD. Currently journal replaying is supported
+ which is not necessarily as effective as JBD would be.
+
+References
+==========
+- Commercial version of the NTFS driver for Linux.
+ https://www.paragon-software.com/home/ntfs-linux-professional/
+
+- Direct e-mail address for feedback and requests on the NTFS3 implementation.
+ almaz.alexandrovich@paragon-software.com
diff --git a/Documentation/filesystems/ocfs2.rst b/Documentation/filesystems/ocfs2.rst
index 412386bc6506..5827062995cb 100644
--- a/Documentation/filesystems/ocfs2.rst
+++ b/Documentation/filesystems/ocfs2.rst
@@ -14,7 +14,7 @@ get "mount.ocfs2" and "ocfs2_hb_ctl".
Project web page: http://ocfs2.wiki.kernel.org
Tools git tree: https://github.com/markfasheh/ocfs2-tools
-OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html
All code copyright 2005 Oracle except when otherwise noted.
diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst
index 4c8bb3074169..a104c25b7a2f 100644
--- a/Documentation/filesystems/omfs.rst
+++ b/Documentation/filesystems/omfs.rst
@@ -24,7 +24,7 @@ More information is available at:
Various utilities, including mkomfs and omfsck, are included with
omfsprogs, available at:
- http://bobcopeland.com/karma/
+ https://bobcopeland.com/karma/
Instructions are included in its README.
diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst
index e41369709c5b..931159e61796 100644
--- a/Documentation/filesystems/orangefs.rst
+++ b/Documentation/filesystems/orangefs.rst
@@ -119,9 +119,7 @@ it comes to that question::
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
-Create an /etc/pvfs2tab file::
-
-Localhost is fine for your pvfs2tab file:
+Create an /etc/pvfs2tab file (localhost is fine)::
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
/etc/pvfs2tab
@@ -276,7 +274,7 @@ then contains:
of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages.
* desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
- bytes of kcalloced memory. This memory is further intialized:
+ bytes of kcalloced memory. This memory is further initialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer.
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index c9d2bf96b02d..165514401441 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -39,18 +39,18 @@ objects in the original filesystem.
On 64bit systems, even if all overlay layers are not on the same
underlying filesystem, the same compliant behavior could be achieved
with the "xino" feature. The "xino" feature composes a unique object
-identifier from the real object st_ino and an underlying fsid index.
-
-If all underlying filesystems support NFS file handles and export file
-handles with 32bit inode number encoding (e.g. ext4), overlay filesystem
-will use the high inode number bits for fsid. Even when the underlying
-filesystem uses 64bit inode numbers, users can still enable the "xino"
-feature with the "-o xino=on" overlay mount option. That is useful for the
-case of underlying filesystems like xfs and tmpfs, which use 64bit inode
-numbers, but are very unlikely to use the high inode number bits. In case
+identifier from the real object st_ino and an underlying fsid number.
+The "xino" feature uses the high inode number bits for fsid, because the
+underlying filesystems rarely use the high inode number bits. In case
the underlying inode number does overflow into the high xino bits, overlay
filesystem will fall back to the non xino behavior for that inode.
+The "xino" feature can be enabled with the "-o xino=on" overlay mount option.
+If all underlying filesystems support NFS file handles, the value of st_ino
+for overlay filesystem objects is not only unique, but also persistent over
+the lifetime of the filesystem. The "-o xino=auto" overlay mount option
+enables the "xino" feature only if the persistent st_ino requirement is met.
+
The following table summarizes what can be expected in different overlay
configurations.
@@ -66,14 +66,13 @@ Inode properties
| All layers | Y | Y | Y | Y | Y | Y | Y | Y |
| on same fs | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
-| Layers not | N | Y | Y | N | N | Y | N | Y |
+| Layers not | N | N | Y | N | N | Y | N | Y |
| on same fs, | | | | | | | | |
| xino=off | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
| xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y |
-| | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
-| xino=on/auto,| N | Y | Y | N | N | Y | N | Y |
+| xino=on/auto,| N | N | Y | N | N | Y | N | Y |
| ino overflow | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
@@ -81,7 +80,6 @@ Inode properties
/proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify
file descriptor.
-
Upper and Lower
---------------
@@ -97,11 +95,13 @@ directory trees to be in the same filesystem and there is no
requirement that the root of a filesystem be given for either upper or
lower.
-The lower filesystem can be any filesystem supported by Linux and does
-not need to be writable. The lower filesystem can even be another
-overlayfs. The upper filesystem will normally be writable and if it
-is it must support the creation of trusted.* extended attributes, and
-must provide valid d_type in readdir responses, so NFS is not suitable.
+A wide range of filesystems supported by Linux can be the lower filesystem,
+but not all filesystems that are mountable by Linux have the features
+needed for OverlayFS to work. The lower filesystem does not need to be
+writable. The lower filesystem can even be another overlayfs. The upper
+filesystem will normally be writable and if it is it must support the
+creation of trusted.* and/or user.* extended attributes, and must provide
+valid d_type in readdir responses, so NFS is not suitable.
A read-only overlay of two read-only filesystems may use any
filesystem type.
@@ -118,7 +118,7 @@ Where both upper and lower objects are directories, a merged directory
is formed.
At mount time, the two directories given as mount options "lowerdir" and
-"upperdir" are combined into a merged directory:
+"upperdir" are combined into a merged directory::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
@@ -145,7 +145,9 @@ filesystem, an overlay filesystem needs to record in the upper filesystem
that files have been removed. This is done using whiteouts and opaque
directories (non-directories are always opaque).
-A whiteout is created as a character device with 0/0 device number.
+A whiteout is created as a character device with 0/0 device number or
+as a zero-size regular file with the xattr "trusted.overlay.whiteout".
+
When a whiteout is found in the upper level of a merged directory, any
matching name in the lower level is ignored, and the whiteout itself
is also hidden.
@@ -154,6 +156,13 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
+An opaque directory should not conntain any whiteouts, because they do not
+serve any purpose. A merge directory containing regular files with the xattr
+"trusted.overlay.whiteout", should be additionally marked by setting the xattr
+"trusted.overlay.opaque" to "x" on the merge directory itself.
+This is needed to avoid the overhead of checking the "trusted.overlay.whiteout"
+on all entries during readdir in the common case.
+
readdir
-------
@@ -172,12 +181,12 @@ directory is being read. This is unlikely to be noticed by many
programs.
seek offsets are assigned sequentially when the directories are read.
-Thus if
+Thus if:
- - read part of a directory
- - remember an offset, and close the directory
- - re-open the directory some time later
- - seek to the remembered offset
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
there may be little correlation between the old and new locations in
the list of filenames, particularly if anything has changed in the
@@ -195,7 +204,7 @@ handle it in two different ways:
1. return EXDEV error: this error is returned by rename(2) when trying to
move a file or directory across filesystem boundaries. Hence
- applications are usually prepared to hande this error (mv(1) for example
+ applications are usually prepared to handle this error (mv(1) for example
recursively copies the directory tree). This is the default behavior.
2. If the "redirect_dir" feature is enabled, then the directory will be
@@ -231,12 +240,11 @@ Mount options:
Redirects are enabled.
- "redirect_dir=follow":
Redirects are not created, but followed.
-- "redirect_dir=off":
- Redirects are not created and only followed if "redirect_always_follow"
- feature is enabled in the kernel/module config.
- "redirect_dir=nofollow":
- Redirects are not created and not followed (equivalent to "redirect_dir=off"
- if "redirect_always_follow" feature is not enabled).
+ Redirects are not created and not followed.
+- "redirect_dir=off":
+ If "redirect_always_follow" is enabled in the kernel/module config,
+ this "off" translates to "follow", otherwise it translates to "nofollow".
When the NFS export feature is enabled, every copied up directory is
indexed by the file handle of the lower inode and a file handle of the
@@ -291,9 +299,9 @@ Permission checking in the overlay filesystem follows these principles:
2) task creating the overlay mount MUST NOT gain additional privileges
3) non-mounting task MAY gain additional privileges through the overlay,
- compared to direct access on underlying lower or upper filesystems
+ compared to direct access on underlying lower or upper filesystems
-This is achieved by performing two permission checks on each access
+This is achieved by performing two permission checks on each access:
a) check if current task is allowed access based on local DAC (owner,
group, mode and posix acl), as well as MAC checks
@@ -312,11 +320,11 @@ to create setups where the consistency rule (1) does not hold; normally,
however, the mounting task will have sufficient privileges to perform all
operations.
-Another way to demonstrate this model is drawing parallels between
+Another way to demonstrate this model is drawing parallels between::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged
-and
+and::
cp -a /lower /upper
mount --bind /upper /merged
@@ -328,8 +336,8 @@ the time of copy (on-demand vs. up-front).
Multiple lower layers
---------------------
-Multiple lower layers can now be given using the the colon (":") as a
-separator character between the directory names. For example:
+Multiple lower layers can now be given using the colon (":") as a
+separator character between the directory names. For example::
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
@@ -340,11 +348,24 @@ The specified lower directories will be stacked beginning from the
rightmost one and going left. In the above example lower1 will be the
top, lower2 the middle and lower3 the bottom layer.
+Note: directory names containing colons can be provided as lower layer by
+escaping the colons with a single backslash. For example::
+
+ mount -t overlay overlay -olowerdir=/a\:lower\:\:dir /merged
+
+Since kernel version v6.8, directory names containing colons can also
+be configured as lower layer using the "lowerdir+" mount options and the
+fsconfig syscall from new mount api. For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/a:lower::dir", 0);
+
+In the latter case, colons in lower layer directory names will be escaped
+as an octal characters (\072) when displayed in /proc/self/mountinfo.
Metadata only copy up
---------------------
-When metadata only copy up feature is enabled, overlayfs will only copy
+When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
like chown/chmod is performed. Full file will be copied up later when
file is opened for WRITE operation.
@@ -365,12 +386,104 @@ pointed by REDIRECT. This should not be possible on local system as setting
"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible
for untrusted layers like from a pen drive.
-Note: redirect_dir={off|nofollow|follow[*]} conflicts with metacopy=on, and
-results in an error.
+Note: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options
+conflict with metacopy=on, and will result in an error.
[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is
given.
+
+Data-only lower layers
+----------------------
+
+With "metacopy" feature enabled, an overlayfs regular file may be a composition
+of information from up to three different layers:
+
+ 1) metadata from a file in the upper layer
+
+ 2) st_ino and st_dev object identifier from a file in a lower layer
+
+ 3) data from a file in another lower layer (further below)
+
+The "lower data" file can be on any lower layer, except from the top most
+lower layer.
+
+Below the top most lower layer, any number of lower most layers may be defined
+as "data-only" lower layers, using double colon ("::") separators.
+A normal lower layer is not allowed to be below a data-only layer, so single
+colon separators are not allowed to the right of double colon ("::") separators.
+
+
+For example::
+
+ mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged
+
+The paths of files in the "data-only" lower layers are not visible in the
+merged overlayfs directories and the metadata and st_ino/st_dev of files
+in the "data-only" lower layers are not visible in overlayfs inodes.
+
+Only the data of the files in the "data-only" lower layers may be visible
+when a "metacopy" file in one of the lower layers above it, has a "redirect"
+to the absolute path of the "lower data" file in the "data-only" lower layer.
+
+Since kernel version v6.8, "data-only" lower layers can also be added using
+the "datadir+" mount options and the fsconfig syscall from new mount api.
+For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l2", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l3", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+
+
+fs-verity support
+-----------------
+
+During metadata copy up of a lower file, if the source file has
+fs-verity enabled and overlay verity support is enabled, then the
+digest of the lower file is added to the "trusted.overlay.metacopy"
+xattr. This is then used to verify the content of the lower file
+each the time the metacopy file is opened.
+
+When a layer containing verity xattrs is used, it means that any such
+metacopy file in the upper layer is guaranteed to match the content
+that was in the lower at the time of the copy-up. If at any time
+(during a mount, after a remount, etc) such a file in the lower is
+replaced or modified in any way, access to the corresponding file in
+overlayfs will result in EIO errors (either on open, due to overlayfs
+digest check, or from a later read due to fs-verity) and a detailed
+error is printed to the kernel logs. For more details of how fs-verity
+file access works, see :ref:`Documentation/filesystems/fsverity.rst
+<accessing_verity_files>`.
+
+Verity can be used as a general robustness check to detect accidental
+changes in the overlayfs directories in use. But, with additional care
+it can also give more powerful guarantees. For example, if the upper
+layer is fully trusted (by using dm-verity or something similar), then
+an untrusted lower layer can be used to supply validated file content
+for all metacopy files. If additionally the untrusted lower
+directories are specified as "Data-only", then they can only supply
+such file content, and the entire mount can be trusted to match the
+upper layer.
+
+This feature is controlled by the "verity" mount option, which
+supports these values:
+
+- "off":
+ The metacopy digest is never generated or used. This is the
+ default if verity option is not specified.
+- "on":
+ Whenever a metacopy files specifies an expected digest, the
+ corresponding data file must match the specified digest. When
+ generating a metacopy file the verity digest will be set in it
+ based on the source file (if it has one).
+- "require":
+ Same as "on", but additionally all metacopy files must specify a
+ digest (or EIO is returned on open). This means metadata copy up
+ will only be used if the data file has fs-verity enabled,
+ otherwise a full copy-up is used.
+
Sharing and copying layers
--------------------------
@@ -388,29 +501,53 @@ though it will not result in a crash or deadlock.
Mounting an overlay using an upper layer path, where the upper layer path
was previously used by another mounted overlay in combination with a
-different lower layer path, is allowed, unless the "inodes index" feature
-or "metadata only copy up" feature is enabled.
+different lower layer path, is allowed, unless the "index" or "metacopy"
+features are enabled.
-With the "inodes index" feature, on the first time mount, an NFS file
+With the "index" feature, on the first time mount, an NFS file
handle of the lower layer root directory, along with the UUID of the lower
filesystem, are encoded and stored in the "trusted.overlay.origin" extended
attribute on the upper layer root directory. On subsequent mount attempts,
the lower root directory file handle and lower filesystem UUID are compared
to the stored origin in upper root directory. On failure to verify the
lower root origin, mount will fail with ESTALE. An overlayfs mount with
-"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem
+"index" enabled will fail with EOPNOTSUPP if the lower filesystem
does not support NFS export, lower filesystem does not have a valid UUID or
if the upper filesystem does not support extended attributes.
-For "metadata only copy up" feature there is no verification mechanism at
+For the "metacopy" feature, there is no verification mechanism at
mount time. So if same upper is mounted with different set of lower, mount
probably will succeed but expect the unexpected later on. So don't do it.
It is quite a common practice to copy overlay layers to a different
directory tree on the same or different underlying filesystem, and even
-to a different machine. With the "inodes index" feature, trying to mount
+to a different machine. With the "index" feature, trying to mount
the copied layers will fail the verification of the lower root file handle.
+Nesting overlayfs mounts
+------------------------
+
+It is possible to use a lower directory that is stored on an overlayfs
+mount. For regular files this does not need any special care. However, files
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs will be
+interpreted by the underlying overlayfs mount and stripped out. In order to
+allow the second overlayfs mount to see the attributes they must be escaped.
+
+Overlayfs specific xattrs are escaped by using a special prefix of
+"overlay.overlay.". So, a file with a "trusted.overlay.overlay.metacopy" xattr
+in the lower dir will be exposed as a regular file with a
+"trusted.overlay.metacopy" xattr in the overlayfs mount. This can be nested by
+repeating the prefix multiple time, as each instance only removes one prefix.
+
+A lower dir with a regular whiteout will always be handled by the overlayfs
+mount, so to support storing an effective whiteout file in an overlayfs mount an
+alternative form of whiteout is supported. This form is a regular, zero-size
+file with the "overlay.whiteout" xattr set, inside a directory with the
+"overlay.opaque" xattr set to "x" (see `whiteouts and opaque directories`_).
+These alternative whiteouts are never created by overlayfs, but can be used by
+userspace tools (like containers) that generate lower layers.
+These alternative whiteouts can be escaped using the standard xattr escape
+mechanism in order to properly nest to any depth.
Non-standard behavior
---------------------
@@ -420,17 +557,21 @@ filesystem.
This is the list of cases that overlayfs doesn't currently handle:
-a) POSIX mandates updating st_atime for reads. This is currently not
-done in the case when the file resides on a lower layer.
+ a) POSIX mandates updating st_atime for reads. This is currently not
+ done in the case when the file resides on a lower layer.
+
+ b) If a file residing on a lower layer is opened for read-only and then
+ memory mapped with MAP_SHARED, then subsequent changes to the file are not
+ reflected in the memory mapping.
-b) If a file residing on a lower layer is opened for read-only and then
-memory mapped with MAP_SHARED, then subsequent changes to the file are not
-reflected in the memory mapping.
+ c) If a file residing on a lower layer is being executed, then opening that
+ file for write or truncating the file will not be denied with ETXTBSY.
The following options allow overlayfs to act more like a standards
compliant filesystem:
-1) "redirect_dir"
+redirect_dir
+````````````
Enabled with the mount option or module option: "redirect_dir=on" or with
the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
@@ -438,7 +579,8 @@ the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
If this feature is disabled, then rename(2) on a lower or merged directory
will fail with EXDEV ("Invalid cross-device link").
-2) "inode index"
+index
+`````
Enabled with the mount option or module option "index=on" or with the
kernel config option CONFIG_OVERLAY_FS_INDEX=y.
@@ -447,7 +589,8 @@ If this feature is disabled and a file with multiple hard links is copied
up, then this will "break" the link. Changes will not be propagated to
other names referring to the same inode.
-3) "xino"
+xino
+````
Enabled with the mount option "xino=auto" or "xino=on", with the module
option "xino_auto=on" or with the kernel config option
@@ -459,7 +602,7 @@ enough free bits in the inode number, then overlayfs will not be able to
guarantee that the values of st_ino and st_dev returned by stat(2) and the
value of d_ino returned by readdir(3) will act like on a normal filesystem.
E.g. the value of st_dev may be different for two objects in the same
-overlay filesystem and the value of st_ino for directory objects may not be
+overlay filesystem and the value of st_ino for filesystem objects may not be
persistent and could change even while the overlay filesystem is mounted, as
summarized in the `Inode properties`_ table above.
@@ -467,14 +610,18 @@ summarized in the `Inode properties`_ table above.
Changes to underlying filesystems
---------------------------------
-Offline changes, when the overlay is not mounted, are allowed to either
-the upper or the lower trees.
-
Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed. If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock.
+Offline changes, when the overlay is not mounted, are allowed to the
+upper tree. Offline changes to the lower tree are only allowed if the
+"metacopy", "index", "xino" and "redirect_dir" features
+have not been used. If the lower tree is modified and any of these
+features has been used, the behavior of the overlay is undefined,
+though it will not result in a crash or deadlock.
+
When the overlay NFS export feature is enabled, overlay filesystems
behavior on offline changes of the underlying lower layer is different
than the behavior when NFS export is disabled.
@@ -510,12 +657,13 @@ directory inode.
When encoding a file handle from an overlay filesystem object, the
following rules apply:
-1. For a non-upper object, encode a lower file handle from lower inode
-2. For an indexed object, encode a lower file handle from copy_up origin
-3. For a pure-upper object and for an existing non-indexed upper object,
- encode an upper file handle from upper inode
+ 1. For a non-upper object, encode a lower file handle from lower inode
+ 2. For an indexed object, encode a lower file handle from copy_up origin
+ 3. For a pure-upper object and for an existing non-indexed upper object,
+ encode an upper file handle from upper inode
The encoded overlay file handle includes:
+
- Header including path type information (e.g. lower/upper)
- UUID of the underlying filesystem
- Underlying filesystem encoding of underlying inode
@@ -525,15 +673,15 @@ are stored in extended attribute "trusted.overlay.origin".
When decoding an overlay file handle, the following steps are followed:
-1. Find underlying layer by UUID and path type information.
-2. Decode the underlying filesystem file handle to underlying dentry.
-3. For a lower file handle, lookup the handle in index directory by name.
-4. If a whiteout is found in index, return ESTALE. This represents an
- overlay object that was deleted after its file handle was encoded.
-5. For a non-directory, instantiate a disconnected overlay dentry from the
- decoded underlying dentry, the path type and index inode, if found.
-6. For a directory, use the connected underlying decoded dentry, path type
- and index, to lookup a connected overlay dentry.
+ 1. Find underlying layer by UUID and path type information.
+ 2. Decode the underlying filesystem file handle to underlying dentry.
+ 3. For a lower file handle, lookup the handle in index directory by name.
+ 4. If a whiteout is found in index, return ESTALE. This represents an
+ overlay object that was deleted after its file handle was encoded.
+ 5. For a non-directory, instantiate a disconnected overlay dentry from the
+ decoded underlying dentry, the path type and index inode, if found.
+ 6. For a directory, use the connected underlying decoded dentry, path type
+ and index, to lookup a connected overlay dentry.
Decoding a non-directory file handle may return a disconnected dentry.
copy_up of that disconnected dentry will create an upper index entry with
@@ -560,6 +708,75 @@ When the NFS export feature is enabled, all directory index entries are
verified on mount time to check that upper file handles are not stale.
This verification may cause significant overhead in some cases.
+Note: the mount options index=off,nfs_export=on are conflicting for a
+read-write mount and will result in an error.
+
+Note: the mount option uuid=off can be used to replace UUID of the underlying
+filesystem in file handles with null, and effectively disable UUID checks. This
+can be useful in case the underlying disk is copied and the UUID of this copy
+is changed. This is only applicable if all lower/upper/work directories are on
+the same filesystem, otherwise it will fallback to normal behaviour.
+
+
+UUID and fsid
+-------------
+
+The UUID of overlayfs instance itself and the fsid reported by statfs(2) are
+controlled by the "uuid" mount option, which supports these values:
+
+- "null":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+- "off":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+ UUID of underlying layers is ignored.
+- "on":
+ UUID of overlayfs is generated and used to report a unique fsid.
+ UUID is stored in xattr "trusted.overlay.uuid", making overlayfs fsid
+ unique and persistent. This option requires an overlayfs with upper
+ filesystem that supports xattrs.
+- "auto": (default)
+ UUID is taken from xattr "trusted.overlay.uuid" if it exists.
+ Upgrade to "uuid=on" on first time mount of new overlay filesystem that
+ meets the prerequites.
+ Downgrade to "uuid=null" for existing overlay filesystems that were never
+ mounted with "uuid=on".
+
+
+Volatile mount
+--------------
+
+This is enabled with the "volatile" mount option. Volatile mounts are not
+guaranteed to survive a crash. It is strongly recommended that volatile
+mounts are only used if data written to the overlay can be recreated
+without significant effort.
+
+The advantage of mounting with the "volatile" option is that all forms of
+sync calls to the upper filesystem are omitted.
+
+In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+semantics of volatile mounts are slightly different than that of the rest of
+VFS. If any writeback error occurs on the upperdir's filesystem after a
+volatile mount takes place, all sync functions will return an error. Once this
+condition is reached, the filesystem will not recover, and every subsequent sync
+call will return an error, even if the upperdir has not experience a new error
+since the last sync call.
+
+When overlay is mounted with "volatile" option, the directory
+"$workdir/work/incompat/volatile" is created. During next mount, overlay
+checks for this directory and refuses to mount if present. This is a strong
+indicator that user should throw away upper and work directories and create
+fresh one. In very limited cases where the user knows that the system has
+not crashed and contents of upperdir are intact, The "volatile" directory
+can be removed.
+
+
+User xattr
+----------
+
+The "-o userxattr" mount option forces overlayfs to use the
+"user.overlay." xattr namespace instead of "trusted.overlay.". This is
+useful for unprivileged mounting of overlayfs.
+
Testsuite
---------
@@ -567,9 +784,9 @@ Testsuite
There's a testsuite originally developed by David Howells and currently
maintained by Amir Goldstein at:
- https://github.com/amir73il/unionmount-testsuite.git
+https://github.com/amir73il/unionmount-testsuite.git
-Run as root:
+Run as root::
# cd unionmount-testsuite
# ./run --ov --verify
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index f46b05e9b96c..2b2df6aa5432 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -43,15 +43,15 @@ characters, and "components" that are sequences of one or more
non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
-from some other location specified by a file descriptor given to a
-"``XXXat``" system call such as `openat() <openat_>`_.
+from some other location specified by a file descriptor given to
+"``*at()``" system calls such as `openat() <openat_>`_.
.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx``at``" system calls
+generally forbidden in POSIX, but some of those "``*at()``" system calls
in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
can execute it by calling `execveat() <execveat_>`_ passing
@@ -69,17 +69,17 @@ pathname that is just slashes have a final component. If it does
exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "``/``". According to POSIX_
+ending in "``/``". According to POSIX_:
- A pathname that contains at least one non- &lt;slash> character and
- that ends with one or more trailing &lt;slash> characters shall not
+ A pathname that contains at least one non-<slash> character and
+ that ends with one or more trailing <slash> characters shall not
be resolved successfully unless the last pathname component before
the trailing <slash> characters names an existing directory or a
directory entry that is to be created for a directory immediately
@@ -229,7 +229,7 @@ happened to be looking at a dentry that was moved in this way,
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (``d_lookup()``) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does *not* try to prevent this
from happening, but only to detect when it happens.
``rename_lock`` is a seqlock that is updated whenever any dentry is
renamed. If ``d_lookup`` finds that a rename happened while it
@@ -376,7 +376,7 @@ table, and the mount point hash table.
Bringing it together with ``struct nameidata``
----------------------------------------------
-.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
in a ``struct nameidata``, "namei" being the traditional name - dating
@@ -398,7 +398,7 @@ held.
``struct qstr last``
~~~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ ``nul`` terminated)
+This is a string together with a length (i.e. *not* ``nul`` terminated)
that is the "next" component in the pathname.
``int last_type``
@@ -448,15 +448,17 @@ described. If it finds a ``LAST_NORM`` component it first calls
filesystem to revalidate the result if it is that sort of filesystem.
If that doesn't get a good result, it calls "``lookup_slow()``" which
takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
-to find a definitive answer. Each of these will call
-``follow_managed()`` (as described below) to handle any mount points.
-
-In the absence of symbolic links, ``walk_component()`` creates a new
-``struct path`` containing a counted reference to the new dentry and a
-reference to the new ``vfsmount`` which is only counted if it is
-different from the previous ``vfsmount``. It then calls
-``path_to_nameidata()`` to install the new ``struct path`` in the
-``struct nameidata`` and drop the unneeded references.
+to find a definitive answer.
+
+As the last step of walk_component(), step_into() will be called either
+directly from walk_component() or from handle_dots(). It calls
+handle_mounts(), to check and handle mount points, in which a new
+``struct path`` is created containing a counted reference to the new dentry and
+a reference to the new ``vfsmount`` which is only counted if it is
+different from the previous ``vfsmount``. Then if there is
+a symbolic link, step_into() calls pick_link() to deal with it,
+otherwise it installs the new ``struct path`` in the ``struct nameidata``, and
+drops the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
@@ -470,8 +472,8 @@ Handling the final component
``nd->last_type`` to refer to the final component of the path. It does
not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
-``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
-``path_openat()`` each of which handles the differing requirements of
+path_lookupat(), path_parentat() and
+path_openat() each of which handles the differing requirements of
different system calls.
``path_parentat()`` is clearly the simplest - it just wraps a little bit
@@ -486,20 +488,18 @@ perform their operation.
object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
calls ``walk_component()`` on the final component through a call to
``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
-
-``path_mountpoint()`` handles the special case of unmounting which must
-not try to revalidate the mounted filesystem. It effectively
-contains, through a call to ``mountpoint_last()``, an alternate
-implementation of ``lookup_slow()`` which skips that step. This is
-important when unmounting a filesystem that is inaccessible, such as
+It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set,
+path_lookupat() will unset LOOKUP_JUMPED in nameidata so that in the
+subsequent path traversal d_weak_revalidate() won't be called.
+This is important when unmounting a filesystem that is inaccessible, such as
one provided by a dead NFS server.
Finally ``path_openat()`` is used for the ``open()`` system call; it
-contains, in support functions starting with "``do_last()``", all the
+contains, in support functions starting with "open_last_lookups()", all the
complexity needed to handle the different subtleties of O_CREAT (with
or without O_EXCL), final "``/``" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which
-focuses on those symbolic links. "``do_last()``" will sometimes, but
+focuses on those symbolic links. "open_last_lookups()" will sometimes, but
not always, take ``i_rwsem``, depending on what it finds.
Each of these, or the functions which call them, need to be alert to
@@ -535,8 +535,7 @@ covered in greater detail in autofs.txt in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
-The Linux VFS has a concept of "managed" dentries which is reflected
-in function names such as "``follow_managed()``". There are three
+The Linux VFS has a concept of "managed" dentries. There are three
potentially interesting things about these dentries corresponding
to three different flags that might be set in ``dentry->d_flags``:
@@ -652,11 +651,11 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and
restarts from the top with REF-walk.
This pattern of "try RCU-walk, if that fails try REF-walk" can be
-clearly seen in functions like ``filename_lookup()``,
-``filename_parentat()``, ``filename_mountpoint()``,
-``do_filp_open()``, and ``do_file_open_root()``. These five
-correspond roughly to the four ``path_``* functions we met earlier,
-each of which calls ``link_path_walk()``. The ``path_*`` functions are
+clearly seen in functions like filename_lookup(),
+filename_parentat(),
+do_filp_open(), and do_file_open_root(). These four
+correspond roughly to the three ``path_*()`` functions we met earlier,
+each of which calls ``link_path_walk()``. The ``path_*()`` functions are
called using different mode flags until a mode is found which works.
They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
that fails with the error ``ECHILD`` they are called again with no
@@ -720,7 +719,7 @@ against a dentry. The length and name pointer are copied into local
variables, then ``read_seqcount_retry()`` is called to confirm the two
are consistent, and only then is ``->d_compare()`` called. When
standard filename comparison is used, ``dentry_cmp()`` is called
-instead. Notably it does _not_ use ``read_seqcount_retry()``, but
+instead. Notably it does *not* use ``read_seqcount_retry()``, but
instead has a large comment explaining why the consistency guarantee
isn't necessary. A subsequent ``read_seqcount_retry()`` will be
sufficient to catch any problem that could occur at this point.
@@ -928,7 +927,7 @@ if anything goes wrong it is much safer to just abort and try a more
sedate approach.
The emphasis here is "try quickly and check". It should probably be
-"try quickly _and carefully,_ then check". The fact that checking is
+"try quickly *and carefully*, then check". The fact that checking is
needed is a reminder that the system is dynamic and only a limited
number of things are safe at all. The most likely cause of errors in
this whole process is assuming something is safe when in reality it
@@ -993,8 +992,8 @@ is 4096. There are a number of reasons for this limit; not letting the
kernel spend too much time on just one path is one of them. With
symbolic links you can effectively generate much longer paths so some
sort of limit is needed for the same reason. Linux imposes a limit of
-at most 40 symlinks in any one path lookup. It previously imposed a
-further limit of eight on the maximum depth of recursion, but that was
+at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed
+a further limit of eight on the maximum depth of recursion, but that was
raised to 40 when a separate stack was implemented, so there is now
just the one limit.
@@ -1061,42 +1060,26 @@ filesystem cannot successfully get a reference in RCU-walk mode, it
must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
REF-walk mode in which the filesystem is allowed to sleep.
-The place for all this to happen is the ``i_op->follow_link()`` inode
-method. In the present mainline code this is never actually called in
-RCU-walk mode as the rewrite is not quite complete. It is likely that
-in a future release this method will be passed an ``inode`` pointer when
-called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
-validated pointer. Much like the ``i_op->permission()`` method we
-looked at previously, ``->follow_link()`` would need to be careful that
+The place for all this to happen is the ``i_op->get_link()`` inode
+method. This is called both in RCU-walk and REF-walk. In RCU-walk the
+``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
+RCU-walk. Much like the ``i_op->permission()`` method we
+looked at previously, ``->get_link()`` would need to be careful that
all the data structures it references are safe to be accessed while
-holding no counted reference, only the RCU lock. Though getting a
-reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
-code is ready to release the reference when that does happen.
-
-This need to drop the reference to a symlink adds significant
-complexity. It requires a reference to the inode so that the
-``i_op->put_link()`` inode operation can be called. In REF-walk, that
-reference is kept implicitly through a reference to the dentry, so
-keeping the ``struct path`` of the symlink is easiest. For RCU-walk,
-the pointer to the inode is kept separately. To allow switching from
-RCU-walk back to REF-walk in the middle of processing nested symlinks
-we also need the seq number for the dentry so we can confirm that
-switching back was safe.
-
-Finally, when providing a reference to a symlink, the filesystem also
-provides an opaque "cookie" that must be passed to ``->put_link()`` so that it
-knows what to free. This might be the allocated memory area, or a
-pointer to the ``struct page`` in the page cache, or something else
-completely. Only the filesystem knows what it is.
+holding no counted reference, only the RCU lock. A callback
+``struct delayed_called`` will be passed to ``->get_link()``:
+file systems can set their own put_link function and argument through
+set_delayed_call(). Later on, when VFS wants to put link, it will call
+do_delayed_call() to invoke that callback function with the argument.
In order for the reference to each symlink to be dropped when the walk completes,
whether in RCU-walk or REF-walk, the symlink stack needs to contain,
along with the path remnants:
-- the ``struct path`` to provide a reference to the inode in REF-walk
-- the ``struct inode *`` to provide a reference to the inode in RCU-walk
+- the ``struct path`` to provide a reference to the previous path
+- the ``const char *`` to provide a reference to the to previous name
- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
-- the ``cookie`` that tells ``->put_path()`` what to put.
+- the ``struct delayed_call`` for later invocation.
This means that each entry in the symlink stack needs to hold five
pointers and an integer instead of just one pointer (the path
@@ -1120,12 +1103,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the
stack is very straightforward; pushing and popping the references is
a little more complex.
-When a symlink is found, ``walk_component()`` returns the value ``1``
-(``0`` is returned for any other sort of success, and a negative number
-is, as usual, an error indicator). This causes ``get_link()`` to be
-called; it then gets the link from the filesystem. Providing that
-operation is successful, the old path ``name`` is placed on the stack,
-and the new value is used as the ``name`` for a while. When the end of
+When a symlink is found, walk_component() calls pick_link() via step_into()
+which returns the link from the filesystem.
+Providing that operation is successful, the old path ``name`` is placed on the
+stack, and the new value is used as the ``name`` for a while. When the end of
the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
off the stack and path walking continues.
@@ -1142,23 +1123,23 @@ stack in ``walk_component()`` immediately when the symlink is found;
old symlink as it walks that last component. So it is quite
convenient for ``walk_component()`` to release the old symlink and pop
the references just before pushing the reference information for the
-new symlink. It is guided in this by two flags; ``WALK_GET``, which
-gives it permission to follow a symlink if it finds one, and
-``WALK_PUT``, which tells it to release the current symlink after it has been
-followed. ``WALK_PUT`` is tested first, leading to a call to
-``put_link()``. ``WALK_GET`` is tested subsequently (by
-``should_follow_link()``) leading to a call to ``pick_link()`` which sets
-up the stack frame.
+new symlink. It is guided in this by three flags: ``WALK_NOFOLLOW`` which
+forbids it from following a symlink if it finds one, ``WALK_MORE``
+which indicates that it is yet too early to release the
+current symlink, and ``WALK_TRAILING`` which indicates that it is on the final
+component of the lookup, so we will check userspace flag ``LOOKUP_FOLLOW`` to
+decide whether follow it when it is a symlink and call ``may_follow_link()`` to
+check if we have privilege to follow it.
Symlinks with no final component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A pair of special-case symlinks deserve a little further explanation.
Both result in a new ``struct path`` (with mount and dentry) being set
-up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
+up in the ``nameidata``, and result in pick_link() returning ``NULL``.
The more obvious case is a symlink to "``/``". All symlinks starting
-with "``/``" are detected in ``get_link()`` which resets the ``nameidata``
+with "``/``" are detected in pick_link() which resets the ``nameidata``
to point to the effective filesystem root. If the symlink only
contains "``/``" then there is nothing more to do, no components at all,
so ``NULL`` is returned to indicate that the symlink can be released and
@@ -1175,12 +1156,11 @@ something that looks like a symlink. It is really a reference to the
target file, not just the name of it. When you ``readlink`` these
objects you get a name that might refer to the same file - unless it
has been unlinked or mounted over. When ``walk_component()`` follows
-one of these, the ``->follow_link()`` method in "procfs" doesn't return
-a string name, but instead calls ``nd_jump_link()`` which updates the
-``nameidata`` in place to point to that target. ``->follow_link()`` then
-returns ``NULL``. Again there is no final component and ``get_link()``
-reports this by leaving the ``last_type`` field of ``nameidata`` as
-``LAST_BIND``.
+one of these, the ``->get_link()`` method in "procfs" doesn't return
+a string name, but instead calls nd_jump_link() which updates the
+``nameidata`` in place to point to that target. ``->get_link()`` then
+returns ``NULL``. Again there is no final component and pick_link()
+returns ``NULL``.
Following the symlink in the final component
--------------------------------------------
@@ -1197,42 +1177,38 @@ potentially need to call ``link_path_walk()`` again and again on
successive symlinks until one is found that doesn't point to another
symlink.
-This case is handled by the relevant caller of ``link_path_walk()``, such as
-``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
-handles the final component. If the final component is a symlink
-that needs to be followed, then ``trailing_symlink()`` is called to set
-things up properly and the loop repeats, calling ``link_path_walk()``
-again. This could loop as many as 40 times if the last component of
-each symlink is another symlink.
-
-The various functions that examine the final component and possibly
-report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
-and ``do_last()``, each of which use the same convention as
-``walk_component()`` of returning ``1`` if a symlink was found that needs
-to be followed.
-
-Of these, ``do_last()`` is the most interesting as it is used for
-opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this
-part is in a separate function: ``lookup_open()``.
-
-Explaining ``do_last()`` completely is beyond the scope of this article,
-but a few highlights should help those interested in exploring the
-code.
-
-1. Rather than just finding the target file, ``do_last()`` needs to open
+This case is handled by relevant callers of link_path_walk(), such as
+path_lookupat(), path_openat() using a loop that calls link_path_walk(),
+and then handles the final component by calling open_last_lookups() or
+lookup_last(). If it is a symlink that needs to be followed,
+open_last_lookups() or lookup_last() will set things up properly and
+return the path so that the loop repeats, calling
+link_path_walk() again. This could loop as many as 40 times if the last
+component of each symlink is another symlink.
+
+Of the various functions that examine the final component,
+open_last_lookups() is the most interesting as it works in tandem
+with do_open() for opening a file. Part of open_last_lookups() runs
+with ``i_rwsem`` held and this part is in a separate function: lookup_open().
+
+Explaining open_last_lookups() and do_open() completely is beyond the scope
+of this article, but a few highlights should help those interested in exploring
+the code.
+
+1. Rather than just finding the target file, do_open() is used after
+ open_last_lookup() to open
it. If the file was found in the dcache, then ``vfs_open()`` is used for
this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if
the filesystem provides it) to combine the final lookup with the open, or
- will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
+ will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
directly. In the later case the actual "open" of this newly found or
- created file will be performed by ``vfs_open()``, just as if the name
+ created file will be performed by vfs_open(), just as if the name
were found in the dcache.
-2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
- wasn't quite current enough. Rather than restarting the lookup from
- the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
- giving the filesystem a chance to resolve small inconsistencies.
- If that doesn't work, only then is the lookup restarted from the top.
+2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
+ wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
+ otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
+ retry with ``LOOKUP_REVAL`` flag set.
3. An open with O_CREAT **does** follow a symlink in the final component,
unlike other creation system calls (like ``mkdir``). So the sequence::
@@ -1242,8 +1218,8 @@ code.
will create a file called ``/tmp/bar``. This is not permitted if
``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
- like for a non-creating open: ``should_follow_link()`` returns ``1``, and
- so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
+ like for a non-creating open: lookup_last() or open_last_lookup()
+ returns a non ``NULL`` value, and link_path_walk() gets called and the
open process continues on the symlink that was found.
Updating the access time
@@ -1265,7 +1241,7 @@ Symlinks are different it seems. Both reading a symlink (with ``readlink()``)
and looking up a symlink on the way to some other destination can
update the atime on that symlink.
-.. _clearest statement: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
+.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
It is not clear why this is the case; POSIX has little to say on the
subject. The `clearest statement`_ is that, if a particular implementation
@@ -1321,18 +1297,18 @@ to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
yet. This is primarily used to tell the audit subsystem the full
context of a particular access being audited.
-``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was
+``ND_ROOT_PRESET`` indicates that the ``root`` field in the ``nameidata`` was
provided by the caller, so it shouldn't be released when it is no
longer needed.
-``LOOKUP_JUMPED`` means that the current dentry was chosen not because
+``ND_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
link"). In this case the filesystem has not been asked to revalidate the
name (with ``d_revalidate()``). In such cases the inode may still need
to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
-``LOOKUP_JUMPED`` is set when the look completes - which may be at the
+``ND_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
Resolution-restriction flags
@@ -1365,7 +1341,7 @@ as well as blocking ".." if it would jump outside the starting point.
resolution of "..". Magic-links are also blocked.
``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
-were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
+were the filesystem root. ``nd_jump_root()`` brings the resolution back to
the starting point, and ".." at the starting point will act as a no-op. As with
``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
attacks against ".." resolution. Magic-links are also blocked.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 9b8930f589d9..1aa7ce099f6f 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -375,7 +375,7 @@ common path elements, the more likely they will exist in dentry cache.
Papers and other documentation on dcache locking
================================================
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124).
2. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 26c093969573..1be76ef117b3 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -45,6 +45,12 @@ typically between calling iget_locked() and unlocking the inode.
At some point that will become mandatory.
+**mandatory**
+
+The foo_inode_info should always be allocated through alloc_inode_sb() rather
+than kmem_cache_alloc() or kmalloc() related to set up the inode reclaim context
+correctly.
+
---
**mandatory**
@@ -171,7 +177,7 @@ settles down a bit.
**mandatory**
s_export_op is now required for exporting a filesystem.
-isofs, ext2, ext3, resierfs, fat
+isofs, ext2, ext3, reiserfs, fat
can be used as examples of very different filesystems.
---
@@ -456,15 +462,15 @@ ERR_PTR(...).
argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask.
generic_permission() has also lost the check_acl argument; ACL checking
-has been taken to VFS and filesystems need to provide a non-NULL ->i_op->get_acl
-to read an ACL from disk.
+has been taken to VFS and filesystems need to provide a non-NULL
+->i_op->get_inode_acl to read an ACL from disk.
---
**mandatory**
If you implement your own ->llseek() you must handle SEEK_HOLE and
-SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to
+SEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer to
support it in some way. The generic handler assumes that the entire file is
data and there is a virtual hole at the end of the file. So if the provided
offset is less than i_size and SEEK_DATA is specified, return the same offset.
@@ -511,7 +517,7 @@ The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and
->create() doesn't take ``struct nameidata *``; unlike the previous
two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that
-local filesystems can ignore tha argument - they are guaranteed that the
+local filesystems can ignore this argument - they are guaranteed that the
object doesn't exist. It's remote/distributed ones that might care...
---
@@ -531,7 +537,7 @@ vfs_readdir() is gone; switch to iterate_dir() instead
**mandatory**
-->readdir() is gone now; switch to ->iterate()
+->readdir() is gone now; switch to ->iterate_shared()
**mandatory**
@@ -618,7 +624,7 @@ any symlink that might use page_follow_link_light/page_put_link() must
have inode_nohighmem(inode) called before anything might start playing with
its pagecache. No highmem pages should end up in the pagecache of such
symlinks. That includes any preseeding that might be done during symlink
-creation. __page_symlink() will honour the mapping gfp flags, so once
+creation. page_symlink() will honour the mapping gfp flags, so once
you've done inode_nohighmem() it's safe to use, but if you allocate and
insert the page manually, make sure to use the right gfp flags.
@@ -687,24 +693,19 @@ parallel now.
---
-**recommended**
+**mandatory**
-->iterate_shared() is added; it's a parallel variant of ->iterate().
+->iterate_shared() is added.
Exclusion on struct file level is still provided (as well as that
between it and lseek on the same struct file), but if your directory
has been opened several times, you can get these called in parallel.
Exclusion between that method and all directory-modifying ones is
still provided, of course.
-Often enough ->iterate() can serve as ->iterate_shared() without any
-changes - it is a read-only operation, after all. If you have any
-per-inode or per-dentry in-core data structures modified by ->iterate(),
-you might need something to serialize the access to them. If you
-do dcache pre-seeding, you'll need to switch to d_alloc_parallel() for
-that; look for in-tree examples.
-
-Old method is only used if the new one is absent; eventually it will
-be removed. Switch while you still can; the old one won't stay.
+If you have any per-inode or per-dentry in-core data structures modified
+by ->iterate_shared(), you might need something to serialize the access
+to them. If you do dcache pre-seeding, you'll need to switch to
+d_alloc_parallel() for that; look for in-tree examples.
---
@@ -717,6 +718,8 @@ be removed. Switch while you still can; the old one won't stay.
**mandatory**
->setxattr() and xattr_handler.set() get dentry and inode passed separately.
+The xattr_handler.set() gets passed the user namespace of the mount the inode
+is seen from so filesystems can idmap the i_uid and i_gid accordingly.
dentry might be yet to be attached to inode, so do _not_ use its ->d_inode
in the instances. Rationale: !@#!@# security_d_instantiate() needs to be
called before we attach dentry to inode and !@#!@##!@$!$#!@#$!@$!@$ smack
@@ -858,3 +861,276 @@ be misspelled d_alloc_anon().
[should've been added in 2016] stale comment in finish_open() nonwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
+
+---
+
+**mandatory**
+
+clone_private_mount() returns a longterm mount now, so the proper destructor of
+its result is kern_unmount() or kern_unmount_array().
+
+---
+
+**mandatory**
+
+zero-length bvec segments are disallowed, they must be filtered out before
+passed on to an iterator.
+
+---
+
+**mandatory**
+
+For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but
+uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and
+page references stay until I/O has completed, i.e. until ->ki_complete() has
+been called or returned with non -EIOCBQUEUED code.
+
+---
+
+**mandatory**
+
+mnt_want_write_file() can now only be paired with mnt_drop_write_file(),
+whereas previously it could be paired with mnt_drop_write() as well.
+
+---
+
+**mandatory**
+
+iov_iter_copy_from_user_atomic() is gone; use copy_page_from_iter_atomic().
+The difference is copy_page_from_iter_atomic() advances the iterator and
+you don't need iov_iter_advance() after it. However, if you decide to use
+only a part of obtained data, you should do iov_iter_revert().
+
+---
+
+**mandatory**
+
+Calling conventions for file_open_root() changed; now it takes struct path *
+instead of passing mount and dentry separately. For callers that used to
+pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper
+is provided - file_open_root_mnt(). In-tree users adjusted.
+
+---
+
+**mandatory**
+
+no_llseek is gone; don't set .llseek to that - just leave it NULL instead.
+Checks for "does that file have llseek(2), or should it fail with ESPIPE"
+should be done by looking at FMODE_LSEEK in file->f_mode.
+
+---
+
+*mandatory*
+
+filldir_t (readdir callbacks) calling conventions have changed. Instead of
+returning 0 or -E... it returns bool now. false means "no more" (as -E... used
+to) and true - "keep going" (as 0 in old calling conventions). Rationale:
+callers never looked at specific -E... values anyway. -> iterate_shared()
+instances require no changes at all, all filldir_t ones in the tree
+converted.
+
+---
+
+**mandatory**
+
+Calling conventions for ->tmpfile() have changed. It now takes a struct
+file pointer instead of struct dentry pointer. d_tmpfile() is similarly
+changed to simplify callers. The passed file is in a non-open state and on
+success must be opened before returning (e.g. by calling
+finish_open_simple()).
+
+---
+
+**mandatory**
+
+Calling convention for ->huge_fault has changed. It now takes a page
+order instead of an enum page_entry_size, and it may be called without the
+mmap_lock held. All in-tree users have been audited and do not seem to
+depend on the mmap_lock being held, but out of tree users should verify
+for themselves. If they do need it, they can return VM_FAULT_RETRY to
+be called with the mmap_lock held.
+
+---
+
+**mandatory**
+
+The order of opening block devices and matching or creating superblocks has
+changed.
+
+The old logic opened block devices first and then tried to find a
+suitable superblock to reuse based on the block device pointer.
+
+The new logic tries to find a suitable superblock first based on the device
+number, and opening the block device afterwards.
+
+Since opening block devices cannot happen under s_umount because of lock
+ordering requirements s_umount is now dropped while opening block devices and
+reacquired before calling fill_super().
+
+In the old logic concurrent mounters would find the superblock on the list of
+superblocks for the filesystem type. Since the first opener of the block device
+would hold s_umount they would wait until the superblock became either born or
+was discarded due to initialization failure.
+
+Since the new logic drops s_umount concurrent mounters could grab s_umount and
+would spin. Instead they are now made to wait using an explicit wait-wake
+mechanism without having to hold s_umount.
+
+---
+
+**mandatory**
+
+The holder of a block device is now the superblock.
+
+The holder of a block device used to be the file_system_type which wasn't
+particularly useful. It wasn't possible to go from block device to owning
+superblock without matching on the device pointer stored in the superblock.
+This mechanism would only work for a single device so the block layer couldn't
+find the owning superblock of any additional devices.
+
+In the old mechanism reusing or creating a superblock for a racing mount(2) and
+umount(2) relied on the file_system_type as the holder. This was severly
+underdocumented however:
+
+(1) Any concurrent mounter that managed to grab an active reference on an
+ existing superblock was made to wait until the superblock either became
+ ready or until the superblock was removed from the list of superblocks of
+ the filesystem type. If the superblock is ready the caller would simple
+ reuse it.
+
+(2) If the mounter came after deactivate_locked_super() but before
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would wait until the superblock was shutdown,
+ reuse the block device and allocate a new superblock.
+
+(3) If the mounter came after deactivate_locked_super() and after
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would reuse the block device and allocate a new
+ superblock (the bd_holder point may still be set to the filesystem type).
+
+Because the holder of the block device was the file_system_type any concurrent
+mounter could open the block devices of any superblock of the same
+file_system_type without risking seeing EBUSY because the block device was
+still in use by another superblock.
+
+Making the superblock the owner of the block device changes this as the holder
+is now a unique superblock and thus block devices associated with it cannot be
+reused by concurrent mounters. So a concurrent mounter in (2) could suddenly
+see EBUSY when trying to open a block device whose holder was a different
+superblock.
+
+The new logic thus waits until the superblock and the devices are shutdown in
+->kill_sb(). Removal of the superblock from the list of superblocks of the
+filesystem type is now moved to a later point when the devices are closed:
+
+(1) Any concurrent mounter managing to grab an active reference on an existing
+ superblock is made to wait until the superblock is either ready or until
+ the superblock and all devices are shutdown in ->kill_sb(). If the
+ superblock is ready the caller will simply reuse it.
+
+(2) If the mounter comes after deactivate_locked_super() but before
+ the superblock has been removed from the list of superblocks of the
+ filesystem type the mounter is made to wait until the superblock and the
+ devices are shut down in ->kill_sb() and the superblock is removed from the
+ list of superblocks of the filesystem type. The mounter will allocate a new
+ superblock and grab ownership of the block device (the bd_holder pointer of
+ the block device will be set to the newly allocated superblock).
+
+(3) This case is now collapsed into (2) as the superblock is left on the list
+ of superblocks of the filesystem type until all devices are shutdown in
+ ->kill_sb(). In other words, if the superblock isn't on the list of
+ superblock of the filesystem type anymore then it has given up ownership of
+ all associated block devices (the bd_holder pointer is NULL).
+
+As this is a VFS level change it has no practical consequences for filesystems
+other than that all of them must use one of the provided kill_litter_super(),
+kill_anon_super(), or kill_block_super() helpers.
+
+---
+
+**mandatory**
+
+Lock ordering has been changed so that s_umount ranks above open_mutex again.
+All places where s_umount was taken under open_mutex have been fixed up.
+
+---
+
+**mandatory**
+
+export_operations ->encode_fh() no longer has a default implementation to
+encode FILEID_INO32_GEN* file handles.
+Filesystems that used the default implementation may use the generic helper
+generic_encode_ino32_fh() explicitly.
+
+---
+
+**mandatory**
+
+If ->rename() update of .. on cross-directory move needs an exclusion with
+directory modifications, do *not* lock the subdirectory in question in your
+->rename() - it's done by the caller now [that item should've been added in
+28eceeda130f "fs: Lock moved directories"].
+
+---
+
+**mandatory**
+
+On same-directory ->rename() the (tautological) update of .. is not protected
+by any locks; just don't do it if the old parent is the same as the new one.
+We really can't lock two subdirectories in same-directory rename - not without
+deadlocks.
+
+---
+
+**mandatory**
+
+lock_rename() and lock_rename_child() may fail in cross-directory case, if
+their arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV)
+is returned, with no locks taken. In-tree users updated; out-of-tree ones
+would need to do so.
+
+---
+
+**mandatory**
+
+The list of children anchored in parent dentry got turned into hlist now.
+Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_child
+for anchor/entries resp.), so any affected places will be immediately caught
+by compiler.
+
+---
+
+**mandatory**
+
+->d_delete() instances are now called for dentries with ->d_lock held
+and refcount equal to 0. They are not permitted to drop/regain ->d_lock.
+None of in-tree instances did anything of that sort. Make sure yours do not...
+
+---
+
+**mandatory**
+
+->d_prune() instances are now called without ->d_lock held on the parent.
+->d_lock on dentry itself is still held; if you need per-parent exclusions (none
+of the in-tree instances did), use your own spinlock.
+
+->d_iput() and ->d_release() are called with victim dentry still in the
+list of parent's children. It is still unhashed, marked killed, etc., just not
+removed from parent's ->d_children yet.
+
+Anyone iterating through the list of children needs to be aware of the
+half-killed dentries that might be seen there; taking ->d_lock on those will
+see them negative, unhashed and with negative refcount, which means that most
+of the in-kernel users would've done the right thing anyway without any adjustment.
+
+---
+
+**recommended**
+
+Block device freezing and thawing have been moved to holder operations.
+
+Before this change, get_active_super() would only be able to find the
+superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
+device freezing now works for any block device owned by a given superblock, not
+just the main block device. The get_active_super() helper and bd_fsfreeze_sb
+pointer are gone.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 38b606991065..104c6d047d9b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,10 +47,13 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
+ 3.13 /proc/<pid>/fd - List of symlinks to open files
4 Configuring procfs
4.1 Mount options
+ 5 Filesystem behavior
+
Preface
=======
@@ -82,7 +85,7 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
document.
The latest version of this document is available online at
-http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
+https://www.kernel.org/doc/html/latest/filesystems/proc.html
If the above direction does not works for you, you could try the kernel
mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
@@ -121,10 +124,10 @@ show you how you can use /proc/sys to change settings.
The directory /proc contains (among other things) one subdirectory for each
process running on the system, which is named after the process ID (PID).
-The link self points to the process reading the file system. Each process
+The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
-Note that an open a file descriptor to /proc/<pid> or to any of its
+Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
open /proc/<pid> file descriptors corresponding to dead processes
@@ -176,6 +179,7 @@ read the file /proc/PID/status::
Gid: 100 100 100 100
FDSize: 256
Groups: 100 14 16
+ Kthread: 0
VmPeak: 5004 kB
VmSize: 5004 kB
VmLck: 0 kB
@@ -208,6 +212,7 @@ read the file /proc/PID/status::
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
+ SpeculationIndirectBranch: conditional enabled
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 1
@@ -218,7 +223,7 @@ file /proc/PID/status. It fields are described in table 1-2.
The statm file contains more detailed information about the process
memory usage. Its seven fields are explained in Table 1-3. The stat file
-contains details information about the process itself. Its fields are
+contains detailed information about the process itself. Its fields are
explained in Table 1-4.
(for SMP CONFIG users)
@@ -228,7 +233,7 @@ asynchronous manner and the value may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise.
-.. table:: Table 1-2: Contents of the status files (as of 4.19)
+.. table:: Table 1-2: Contents of the status fields (as of 4.19)
========================== ===================================================
Field Content
@@ -242,7 +247,8 @@ It's slow but very precise.
Ngid NUMA group ID (0 if none)
Pid process id
PPid process id of the parent process
- TracerPid PID of process tracing this process (0 if not)
+ TracerPid PID of process tracing this process (0 if not, or
+ the tracer is outside of the current pid namespace)
Uid Real, effective, saved set, and file system UIDs
Gid Real, effective, saved set, and file system GIDs
FDSize number of file descriptor slots currently allocated
@@ -251,6 +257,7 @@ It's slow but very precise.
NSpid descendant namespace process ID hierarchy
NSpgid descendant namespace process group ID hierarchy
NSsid descendant namespace session ID hierarchy
+ Kthread kernel thread flag, 1 is yes, 0 is no
VmPeak peak virtual memory size
VmSize total program size
VmLck locked memory size
@@ -290,6 +297,7 @@ It's slow but very precise.
NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
Speculation_Store_Bypass speculative store bypass mitigation status
+ SpeculationIndirectBranch indirect branch speculation mode
Cpus_allowed mask of CPUs on which this process may run
Cpus_allowed_list Same as previous, but in "list format"
Mems_allowed mask of memory nodes allowed to this process
@@ -299,7 +307,7 @@ It's slow but very precise.
========================== ===================================================
-.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
+.. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
======== =============================== ==============================
Field Content
@@ -317,7 +325,7 @@ It's slow but very precise.
======== =============================== ==============================
-.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+.. table:: Table 1-4: Contents of the stat fields (as of 2.6.30-rc7)
============= ===============================================================
Field Content
@@ -422,12 +430,16 @@ with the memory region, as the case would be with BSS (uninitialized data).
The "pathname" shows the name associated file for this mapping. If the mapping
is not associated with a file:
- ======= ====================================
+ =================== ===========================================
[heap] the heap of the program
[stack] the stack of the main process
[vdso] the "virtual dynamic shared object",
the kernel system call handler
- ======= ====================================
+ [anon:<name>] a private anonymous mapping that has been
+ named by userspace
+ [anon_shmem:<name>] an anonymous shared memory mapping that has
+ been named by userspace
+ =================== ===========================================
or if empty, the mapping is anonymous.
@@ -442,12 +454,14 @@ Memory Area, or VMA) there is a series of lines such as the following::
MMUPageSize: 4 kB
Rss: 892 kB
Pss: 374 kB
+ Pss_Dirty: 0 kB
Shared_Clean: 892 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+ KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
@@ -473,7 +487,9 @@ dirty shared and private pages in the mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one other
-process, its PSS will be 1500.
+process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
+consists of dirty pages. ("Pss_Clean" is not included, but it can be
+calculated by subtracting "Pss_Dirty" from "Pss".)
Note that even a page which is part of a MAP_SHARED mapping, but has only
a single pte mapped, i.e. is currently used by only one process, is accounted
@@ -486,18 +502,21 @@ accessed.
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
+"KSM" reports how many of the pages are KSM pages. Note that KSM-placed zeropages
+are not included, only actual KSM pages.
+
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory
pressure if the memory is clean. Please note that the printed value might
be lower than the real value due to optimizations used in the current
implementation. If this is not desirable please file a bug report.
-"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage.
-"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
+"ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
huge pages.
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
@@ -508,8 +527,10 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages - 1 if true, 0 otherwise. It just shows the current status.
+
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled size. 1 if true, 0
+otherwise.
"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
@@ -536,13 +557,20 @@ encoded manner. The codes are the following:
ac area is accountable
nr swap space is not reserved for the area
ht area uses huge tlb pages
+ sf synchronous page fault
ar architecture specific flag
+ wf wipe on fork
dd do not include area into core dump
sd soft dirty flag
mm mixed map area
hg huge page advise flag
nh no huge page advise flag
- mg mergable advise flag
+ mg mergeable advise flag
+ bt arm64 BTI guarded page
+ mt arm64 MTE allocation tags are enabled
+ um userfaultfd missing tracking
+ uw userfaultfd wr-protect tracking
+ ss shadow stack page
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
@@ -661,9 +689,15 @@ files are there, and which are missing.
File Content
============ ===============================================================
apm Advanced power management info
+ bootconfig Kernel command line obtained from boot config,
+ and, if there were kernel parameters from the
+ boot loader, a "# Parameters from bootloader:"
+ line followed by a line containing those
+ parameters prefixed by "# ". (5.5)
buddyinfo Kernel memory allocator information (see text) (2.5)
bus Directory containing bus specific information
- cmdline Kernel command line
+ cmdline Kernel command line, both from bootloader and embedded
+ in the kernel image
cpuinfo Info about the CPU
devices Available devices (block and character)
dma Used DMS channels
@@ -681,7 +715,14 @@ files are there, and which are missing.
kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
kmsg Kernel messages
ksyms Kernel symbol table
- loadavg Load average of last 1, 5 & 15 minutes
+ loadavg Load average of last 1, 5 & 15 minutes;
+ number of processes currently runnable (running or on ready queue);
+ total number of processes in system;
+ last pid created.
+ All fields are separated by one space except "number of
+ processes currently runnable" and "total number of processes
+ in system", which are separated by a slash ('/'). Example:
+ 0.61 0.61 0.55 3/828 22084
locks Kernel locks
meminfo Memory info
misc Miscellaneous
@@ -779,7 +820,7 @@ SPU
For this case the APIC will generate the interrupt with a IRQ vector
of 0xff. This might also be generated by chipset bugs.
-RES, CAL, TLB]
+RES, CAL, TLB
rescheduling, call and TLB flush interrupts are
sent from one CPU to another per the needs of the OS. Typically,
their statistics are used by kernel developers and interested users to
@@ -791,7 +832,7 @@ suppressed when the system is a uniprocessor. As of this writing, only
i386 and x86_64 platforms support the new IRQ vector displays.
Of some interest is the introduction of the /proc/irq directory to 2.4.
-It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+It could be used to set IRQ to CPU affinity. This means that you can "hook" an
IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
prof_cpu_mask.
@@ -805,7 +846,7 @@ For example::
smp_affinity
smp_affinity is a bitmask, in which you can specify which CPUs can handle the
-IRQ, you can set it by doing::
+IRQ. You can set it by doing::
> echo 1 > /proc/irq/10/smp_affinity
@@ -818,7 +859,7 @@ The contents of each smp_affinity file is the same by default::
ffffffff
There is an alternate interface, smp_affinity_list which allows specifying
-a cpu range instead of a bitmask::
+a CPU range instead of a bitmask::
> cat /proc/irq/0/smp_affinity_list
1024-1031
@@ -832,7 +873,7 @@ reports itself as being attached. This hardware locality information does not
include information about any possible driver locality preference.
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
-profiler. Default value is ffffffff (all cpus if there are only 32 of them).
+profiler. Default value is ffffffff (all CPUs if there are only 32 of them).
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
between all the CPUs which are allowed to handle it. As usual the kernel has
@@ -894,7 +935,7 @@ pagetypeinfo::
Fragmentation avoidance in the kernel works by grouping pages of different
migrate types into the same contiguous regions of memory called page blocks.
-A page block is typically the size of the default hugepage size e.g. 2MB on
+A page block is typically the size of the default hugepage size, e.g. 2MB on
X86-64. By keeping pages grouped based on their ability to move, the kernel
can reclaim pages within a page block to satisfy a high-order allocation.
@@ -916,56 +957,82 @@ meminfo
~~~~~~~
Provides information about distribution and utilization of memory. This
-varies by architecture and compile options. The following is from a
-16GB PIII, which has highmem enabled. You may not have all of these fields.
+varies by architecture and compile options. Some of the counters reported
+here overlap. The memory reported by the non overlapping counters may not
+add up to the overall memory usage and the difference for some workloads
+can be substantial. In many cases there are other means to find out
+additional memory using subsystem specific interfaces, for instance
+/proc/net/sockstat for TCP memory allocations.
+
+Example output. You may not have all of these fields.
::
> cat /proc/meminfo
- MemTotal: 16344972 kB
- MemFree: 13634064 kB
- MemAvailable: 14836172 kB
- Buffers: 3656 kB
- Cached: 1195708 kB
- SwapCached: 0 kB
- Active: 891636 kB
- Inactive: 1077224 kB
- HighTotal: 15597528 kB
- HighFree: 13629632 kB
- LowTotal: 747444 kB
- LowFree: 4432 kB
- SwapTotal: 0 kB
- SwapFree: 0 kB
- Dirty: 968 kB
- Writeback: 0 kB
- AnonPages: 861800 kB
- Mapped: 280372 kB
- Shmem: 644 kB
- KReclaimable: 168048 kB
- Slab: 284364 kB
- SReclaimable: 159856 kB
- SUnreclaim: 124508 kB
- PageTables: 24448 kB
- NFS_Unstable: 0 kB
- Bounce: 0 kB
- WritebackTmp: 0 kB
- CommitLimit: 7669796 kB
- Committed_AS: 100056 kB
- VmallocTotal: 112216 kB
- VmallocUsed: 428 kB
- VmallocChunk: 111088 kB
- Percpu: 62080 kB
- HardwareCorrupted: 0 kB
- AnonHugePages: 49152 kB
- ShmemHugePages: 0 kB
- ShmemPmdMapped: 0 kB
+ MemTotal: 32858820 kB
+ MemFree: 21001236 kB
+ MemAvailable: 27214312 kB
+ Buffers: 581092 kB
+ Cached: 5587612 kB
+ SwapCached: 0 kB
+ Active: 3237152 kB
+ Inactive: 7586256 kB
+ Active(anon): 94064 kB
+ Inactive(anon): 4570616 kB
+ Active(file): 3143088 kB
+ Inactive(file): 3015640 kB
+ Unevictable: 0 kB
+ Mlocked: 0 kB
+ SwapTotal: 0 kB
+ SwapFree: 0 kB
+ Zswap: 1904 kB
+ Zswapped: 7792 kB
+ Dirty: 12 kB
+ Writeback: 0 kB
+ AnonPages: 4654780 kB
+ Mapped: 266244 kB
+ Shmem: 9976 kB
+ KReclaimable: 517708 kB
+ Slab: 660044 kB
+ SReclaimable: 517708 kB
+ SUnreclaim: 142336 kB
+ KernelStack: 11168 kB
+ PageTables: 20540 kB
+ SecPageTables: 0 kB
+ NFS_Unstable: 0 kB
+ Bounce: 0 kB
+ WritebackTmp: 0 kB
+ CommitLimit: 16429408 kB
+ Committed_AS: 7715148 kB
+ VmallocTotal: 34359738367 kB
+ VmallocUsed: 40444 kB
+ VmallocChunk: 0 kB
+ Percpu: 29312 kB
+ EarlyMemtestBad: 0 kB
+ HardwareCorrupted: 0 kB
+ AnonHugePages: 4149248 kB
+ ShmemHugePages: 0 kB
+ ShmemPmdMapped: 0 kB
+ FileHugePages: 0 kB
+ FilePmdMapped: 0 kB
+ CmaTotal: 0 kB
+ CmaFree: 0 kB
+ HugePages_Total: 0
+ HugePages_Free: 0
+ HugePages_Rsvd: 0
+ HugePages_Surp: 0
+ Hugepagesize: 2048 kB
+ Hugetlb: 0 kB
+ DirectMap4k: 401152 kB
+ DirectMap2M: 10008576 kB
+ DirectMap1G: 24117248 kB
MemTotal
- Total usable ram (i.e. physical ram minus a few reserved
+ Total usable RAM (i.e. physical RAM minus a few reserved
bits and the kernel binary code)
MemFree
- The sum of LowFree+HighFree
+ Total free RAM. On highmem systems, the sum of LowFree+HighFree
MemAvailable
An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree,
@@ -979,8 +1046,9 @@ Buffers
Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached
- in-memory cache for files read from the disk (the
- pagecache). Doesn't include SwapCached
+ In-memory cache for files read from the disk (the
+ pagecache) as well as tmpfs & shmem.
+ Doesn't include SwapCached.
SwapCached
Memory that once was swapped out, is swapped back in but
still also is in the swapfile (if memory is needed it
@@ -992,8 +1060,13 @@ Active
Inactive
Memory which has been less recently used. It is more
eligible to be reclaimed for other purposes
+Unevictable
+ Memory allocated for userspace which cannot be reclaimed, such
+ as mlocked pages, ramfs backing pages, secret memfd pages etc.
+Mlocked
+ Memory locked with mlock().
HighTotal, HighFree
- Highmem is all memory above ~860MB of physical memory
+ Highmem is all memory above ~860MB of physical memory.
Highmem areas are for use by userspace programs, or
for the pagecache. The kernel must use tricks to access
this memory, making it slower to access than lowmem.
@@ -1008,26 +1081,20 @@ SwapTotal
SwapFree
Memory which has been evicted from RAM, and is temporarily
on the disk
+Zswap
+ Memory consumed by the zswap backend (compressed size)
+Zswapped
+ Amount of anonymous memory stored in zswap (original size)
Dirty
Memory which is waiting to get written back to the disk
Writeback
Memory which is actively being written back to the disk
AnonPages
Non-file backed pages mapped into userspace page tables
-HardwareCorrupted
- The amount of RAM/memory in KB, the kernel identifies as
- corrupted.
-AnonHugePages
- Non-file backed huge pages mapped into userspace page tables
Mapped
- files which have been mmaped, such as libraries
+ files which have been mmapped, such as libraries
Shmem
Total memory used by shared memory (shmem) and tmpfs
-ShmemHugePages
- Memory used by shared memory (shmem) and tmpfs allocated
- with huge pages
-ShmemPmdMapped
- Shared memory mapped into userspace with huge pages
KReclaimable
Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other
@@ -1038,12 +1105,16 @@ SReclaimable
Part of Slab, that might be reclaimed, such as caches
SUnreclaim
Part of Slab, that cannot be reclaimed on memory pressure
+KernelStack
+ Memory consumed by the kernel stacks of all tasks
PageTables
- amount of memory dedicated to the lowest level of page
- tables.
+ Memory consumed by userspace page tables
+SecPageTables
+ Memory consumed by secondary page tables, this currently
+ currently includes KVM mmu allocations on x86 and arm64.
NFS_Unstable
- NFS pages sent to the server, but not yet committed to stable
- storage
+ Always zero. Previous counted pages which had been written to
+ the server, but has not been committed to stable storage.
Bounce
Memory used for block device "bounce buffers"
WritebackTmp
@@ -1065,23 +1136,23 @@ CommitLimit
yield a CommitLimit of 7.3G.
For more details, see the memory overcommit documentation
- in vm/overcommit-accounting.
+ in mm/overcommit-accounting.
Committed_AS
The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which
has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G
of memory, but only touches 300M of it will show up as
- using 1G. This 1G is memory which has been "committed" to
+ using 1G. This 1G is memory which has been "committed" to
by the VM and can be used at any time by the allocating
application. With strict overcommit enabled on the system
- (mode 2 in 'vm.overcommit_memory'),allocations which would
+ (mode 2 in 'vm.overcommit_memory'), allocations which would
exceed the CommitLimit (detailed above) will not be permitted.
This is useful if one needs to guarantee that processes will
not fail due to lack of memory once that memory has been
successfully allocated.
VmallocTotal
- total size of vmalloc memory area
+ total size of vmalloc virtual address space
VmallocUsed
amount of vmalloc area which is used
VmallocChunk
@@ -1089,6 +1160,37 @@ VmallocChunk
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
+EarlyMemtestBad
+ The amount of RAM/memory in kB, that was identified as corrupted
+ by early memtest. If memtest was not run, this field will not
+ be displayed at all. Size is never rounded down to 0 kB.
+ That means if 0 kB is reported, you can safely assume
+ there was at least one pass of memtest and none of the passes
+ found a single faulty byte of RAM.
+HardwareCorrupted
+ The amount of RAM/memory in KB, the kernel identifies as
+ corrupted.
+AnonHugePages
+ Non-file backed huge pages mapped into userspace page tables
+ShmemHugePages
+ Memory used by shared memory (shmem) and tmpfs allocated
+ with huge pages
+ShmemPmdMapped
+ Shared memory mapped into userspace with huge pages
+FileHugePages
+ Memory used for filesystem data (page cache) allocated
+ with huge pages
+FilePmdMapped
+ Page cache mapped into userspace with huge pages
+CmaTotal
+ Memory reserved for the Contiguous Memory Allocator (CMA)
+CmaFree
+ Free remaining memory in the CMA reserves
+HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
+ See Documentation/admin-guide/mm/hugetlbpage.rst.
+DirectMap4k, DirectMap2M, DirectMap1G
+ Breakdown of page table sizes used in the kernel's
+ identity mapping of RAM
vmallocinfo
~~~~~~~~~~~
@@ -1096,7 +1198,7 @@ vmallocinfo
Provides information about vmalloced/vmaped areas. One line per area,
containing the virtual address range of the area, size in bytes,
caller information of the creator, and optional information depending
-on the kind of area :
+on the kind of area:
========== ===================================================
pages=nr number of pages
@@ -1141,101 +1243,23 @@ on the kind of area :
softirqs
~~~~~~~~
-Provides counts of softirq handlers serviced since boot time, for each cpu.
+Provides counts of softirq handlers serviced since boot time, for each CPU.
::
> cat /proc/softirqs
- CPU0 CPU1 CPU2 CPU3
+ CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0
- TIMER: 27166 27120 27097 27034
+ TIMER: 27166 27120 27097 27034
NET_TX: 0 0 0 17
NET_RX: 42 0 0 39
- BLOCK: 0 0 107 1121
- TASKLET: 0 0 0 290
- SCHED: 27035 26983 26971 26746
- HRTIMER: 0 0 0 0
- RCU: 1678 1769 2178 2250
-
-
-1.3 IDE devices in /proc/ide
-----------------------------
-
-The subdirectory /proc/ide contains information about all IDE devices of which
-the kernel is aware. There is one subdirectory for each IDE controller, the
-file drivers and a link for each IDE device, pointing to the device directory
-in the controller specific subtree.
-
-The file drivers contains general information about the drivers used for the
-IDE devices::
-
- > cat /proc/ide/drivers
- ide-cdrom version 4.53
- ide-disk version 1.08
-
-More detailed information can be found in the controller specific
-subdirectories. These are named ide0, ide1 and so on. Each of these
-directories contains the files shown in table 1-6.
-
-
-.. table:: Table 1-6: IDE controller info in /proc/ide/ide?
-
- ======= =======================================
- File Content
- ======= =======================================
- channel IDE channel (0 or 1)
- config Configuration (only for PCI/IDE bridge)
- mate Mate name
- model Type/Chipset of IDE controller
- ======= =======================================
-
-Each device connected to a controller has a separate subdirectory in the
-controllers directory. The files listed in table 1-7 are contained in these
-directories.
-
-
-.. table:: Table 1-7: IDE device information
-
- ================ ==========================================
- File Content
- ================ ==========================================
- cache The cache
- capacity Capacity of the medium (in 512Byte blocks)
- driver driver and version
- geometry physical and logical geometry
- identify device identify block
- media media type
- model device identifier
- settings device setup
- smart_thresholds IDE disk management thresholds
- smart_values IDE disk management values
- ================ ==========================================
-
-The most interesting file is ``settings``. This file contains a nice
-overview of the drive parameters::
-
- # cat /proc/ide/ide0/hda/settings
- name value min max mode
- ---- ----- --- --- ----
- bios_cyl 526 0 65535 rw
- bios_head 255 0 255 rw
- bios_sect 63 0 63 rw
- breada_readahead 4 0 127 rw
- bswap 0 0 1 r
- file_readahead 72 0 2097151 rw
- io_32bit 0 0 3 rw
- keepsettings 0 0 1 rw
- max_kb_per_request 122 1 127 rw
- multcount 0 0 8 rw
- nice1 1 0 1 rw
- nowerr 0 0 1 rw
- pio_mode write-only 0 255 w
- slow 0 0 1 rw
- unmaskirq 0 0 1 rw
- using_dma 0 0 1 rw
-
-
-1.4 Networking info in /proc/net
+ BLOCK: 0 0 107 1121
+ TASKLET: 0 0 0 290
+ SCHED: 27035 26983 26971 26746
+ HRTIMER: 0 0 0 0
+ RCU: 1678 1769 2178 2250
+
+1.3 Networking info in /proc/net
--------------------------------
The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the
@@ -1281,6 +1305,7 @@ support this. Table 1-9 lists the files and their meaning.
rt_cache Routing cache
snmp SNMP data
sockstat Socket statistics
+ softnet_stat Per-CPU incoming packets queues statistics of online CPUs
tcp TCP sockets
udp UDP sockets
unix UNIX domain sockets
@@ -1314,12 +1339,12 @@ It will contain information that is specific to that bond, such as the
current slaves of the bond, the link status of the slaves, and how
many times the slaves link has failed.
-1.5 SCSI info
+1.4 SCSI info
-------------
-If you have a SCSI host adapter in your system, you'll find a subdirectory
-named after the driver for this adapter in /proc/scsi. You'll also see a list
-of all recognized SCSI devices in /proc/scsi::
+If you have a SCSI or ATA host adapter in your system, you'll find a
+subdirectory named after the driver for this adapter in /proc/scsi.
+You'll also see a list of all recognized SCSI devices in /proc/scsi::
>cat /proc/scsi/scsi
Attached devices:
@@ -1377,7 +1402,7 @@ AHA-2940 SCSI adapter::
Total transfers 0 (0 reads and 0 writes)
-1.6 Parallel port info in /proc/parport
+1.5 Parallel port info in /proc/parport
---------------------------------------
The directory /proc/parport contains information about the parallel ports of
@@ -1402,11 +1427,11 @@ These directories contain the four files shown in Table 1-10.
number or none).
========= ====================================================================
-1.7 TTY info in /proc/tty
+1.6 TTY info in /proc/tty
-------------------------
Information about the available and actually used tty's can be found in the
-directory /proc/tty.You'll find entries for drivers and line disciplines in
+directory /proc/tty. You'll find entries for drivers and line disciplines in
this directory, as shown in Table 1-11.
@@ -1437,7 +1462,7 @@ To see which tty's are currently in use, you can simply look into the file
unknown /dev/tty 4 1-63 console
-1.8 Miscellaneous kernel statistics in /proc/stat
+1.7 Miscellaneous kernel statistics in /proc/stat
-------------------------------------------------
Various pieces of information about kernel activity are available in the
@@ -1445,16 +1470,18 @@ Various pieces of information about kernel activity are available in the
since the system first booted. For a quick look, simply cat the file::
> cat /proc/stat
- cpu 2255 34 2290 22625563 6290 127 456 0 0 0
- cpu0 1132 34 1441 11311718 3675 127 438 0 0 0
- cpu1 1123 0 849 11313845 2614 0 18 0 0 0
- intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
- ctxt 1990473
- btime 1062191376
- processes 2915
- procs_running 1
+ cpu 237902850 368826709 106375398 1873517540 1135548 0 14507935 0 0 0
+ cpu0 60045249 91891769 26331539 468411416 495718 0 5739640 0 0 0
+ cpu1 59746288 91759249 26609887 468860630 312281 0 4384817 0 0 0
+ cpu2 59489247 92985423 26904446 467808813 171668 0 2268998 0 0 0
+ cpu3 58622065 92190267 26529524 468436680 155879 0 2114478 0 0 0
+ intr 8688370575 8 3373 0 0 0 0 0 0 1 40791 0 0 353317 0 0 0 0 224789828 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 190974333 41958554 123983334 43 0 224593 0 0 0 <more 0's deleted>
+ ctxt 22848221062
+ btime 1605316999
+ processes 746787147
+ procs_running 2
procs_blocked 0
- softirq 183433 0 21755 12 39 1137 231 21459 2263
+ softirq 12121874454 100099120 3938138295 127375644 2795979 187870761 0 173808342 3072582055 52608 224184354
The very first "cpu" line aggregates the numbers in all of the other "cpuN"
lines. These numbers identify the amount of time the CPU has spent performing
@@ -1468,9 +1495,9 @@ second). The meanings of the columns are as follows, from left to right:
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems:
- 1. Cpu will not wait for I/O to complete, iowait is the time that a task is
- waiting for I/O to complete. When cpu goes into idle state for
- outstanding task io, another task will be scheduled on this CPU.
+ 1. CPU will not wait for I/O to complete, iowait is the time that a task is
+ waiting for I/O to complete. When CPU goes into idle state for
+ outstanding task I/O, another task will be scheduled on this CPU.
2. In a multi-core CPU, the task waiting for I/O to complete is not running
on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain
@@ -1510,14 +1537,14 @@ softirqs serviced; each subsequent column is the total for that particular
softirq.
-1.9 Ext4 file system parameters
+1.8 Ext4 file system parameters
-------------------------------
Information about mounted ext4 file systems can be found in
/proc/fs/ext4. Each mounted filesystem will have a directory in
/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
-/proc/fs/ext4/dm-0). The files in each per-device directory are shown
-in Table 1-12, below.
+/proc/fs/ext4/sda9 or /proc/fs/ext4/dm-0). The files in each per-device
+directory are shown in Table 1-12, below.
.. table:: Table 1-12: Files in /proc/fs/ext4/<devname>
@@ -1526,8 +1553,8 @@ in Table 1-12, below.
mb_groups details of multiblock allocator buddy cache of free blocks
============== ==========================================================
-2.0 /proc/consoles
-------------------
+1.9 /proc/consoles
+-------------------
Shows registered system console lines.
To see which character device lines are currently used for the system console
@@ -1587,10 +1614,9 @@ production system. Set up a development machine and test to make sure that
everything works the way you want it to. You may have no alternative but to
reboot the machine once an error has been made.
-To change a value, simply echo the new value into the file. An example is
-given below in the section on the file system data. You need to be root to do
-this. You can create your own boot script to perform this every time your
-system boots.
+To change a value, simply echo the new value into the file.
+You need to be root to do this. You can create your own boot script
+to perform this every time your system boots.
The files in /proc/sys can be used to fine tune and monitor miscellaneous and
general things in the operation of the Linux kernel. Since some of the files
@@ -1598,12 +1624,12 @@ can inadvertently disrupt your system, it is advisable to read both
documentation and source before actually making adjustments. In any case, be
very careful when writing to any of these files. The entries in /proc may
change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
-review the kernel documentation in the directory /usr/src/linux/Documentation.
+review the kernel documentation in the directory linux/Documentation.
This chapter is heavily based on the documentation included in the pre 2.2
kernels, and became part of it in version 2.2.1 of the Linux kernel.
-Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these
-entries.
+Please see: Documentation/admin-guide/sysctl/ directory for descriptions of
+these entries.
Summary
-------
@@ -1621,8 +1647,8 @@ Chapter 3: Per-process Parameters
3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
--------------------------------------------------------------------------------
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
+These files can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory (oom) conditions.
The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted. The
@@ -1631,9 +1657,6 @@ may allocate from based on an estimation of its current memory and swap use.
For example, if a task is using all allowed memory, its badness score will be
1000. If it is using half of its allowed memory, its score will be 500.
-There is an additional factor included in the badness score: the current memory
-and swap usage is discounted by 3% for root processes.
-
The amount of "allowed" memory depends on the context in which the oom killer
was called. If it is due to the memory assigned to the allocating task's cpuset
being exhausted, the allowed memory represents the set of mems assigned to that
@@ -1669,24 +1692,22 @@ The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
requires CAP_SYS_RESOURCE.
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with separate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
-This file can be used to check the current score used by the oom-killer is for
+This file can be used to check the current score used by the oom-killer for
any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
+Please note that the exported value includes oom_score_adj so it is
+effectively in range [0,2000].
+
3.3 /proc/<pid>/io - Display the IO accounting fields
-------------------------------------------------------
-This file contains IO statistics for each running process
+This file contains IO statistics for each running process.
Example
~~~~~~~
@@ -1717,7 +1738,7 @@ The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
-pagecache)
+pagecache).
wchar
@@ -1839,19 +1860,19 @@ For example::
This file contains lines of the form::
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
- (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
-
- (1) mount ID: unique identifier of the mount (may be reused after umount)
- (2) parent ID: ID of parent (or of self for the top of the mount tree)
- (3) major:minor: value of st_dev for files on filesystem
- (4) root: root of the mount within the filesystem
- (5) mount point: mount point relative to the process's root
- (6) mount options: per mount options
- (7) optional fields: zero or more fields of the form "tag[:value]"
- (8) separator: marks the end of the optional fields
- (9) filesystem type: name of filesystem of the form "type[.subtype]"
- (10) mount source: filesystem specific information or "none"
- (11) super options: per super block options
+ (1)(2)(3) (4) (5) (6) (n…m) (m+1)(m+2) (m+3) (m+4)
+
+ (1) mount ID: unique identifier of the mount (may be reused after umount)
+ (2) parent ID: ID of parent (or of self for the top of the mount tree)
+ (3) major:minor: value of st_dev for files on filesystem
+ (4) root: root of the mount within the filesystem
+ (5) mount point: mount point relative to the process's root
+ (6) mount options: per mount options
+ (n…m) optional fields: zero or more fields of the form "tag[:value]"
+ (m+1) separator: marks the end of the optional fields
+ (m+2) filesystem type: name of filesystem of the form "type[.subtype]"
+ (m+3) mount source: filesystem specific information or "none"
+ (m+4) super options: per super block options
Parsers should ignore all unrecognised optional fields. Currently the
possible optional fields are:
@@ -1870,12 +1891,12 @@ unbindable mount is unbindable
For more information on mount propagation see:
- Documentation/filesystems/sharedsubtree.txt
+ Documentation/filesystems/sharedsubtree.rst
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
--------------------------------------------------------
-These files provide a method to access a tasks comm value. It also allows for
+These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
@@ -1888,32 +1909,34 @@ This file provides a fast way to retrieve first level children pids
of a task pointed by <pid>/<tid> pair. The format is a space separated
stream of pids.
-Note the "first level" here -- if a child has own children they will
-not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+Note the "first level" here -- if a child has its own children they will
+not be listed here; one needs to read /proc/<children-pid>/task/<tid>/children
to obtain the descendants.
Since this interface is intended to be fast and cheap it doesn't
guarantee to provide precise results and some children might be
skipped, especially if they've exited right after we printed their
-pids, so one need to either stop or freeze processes being inspected
+pids, so one needs to either stop or freeze processes being inspected
if precise results are needed.
3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
---------------------------------------------------------------
This file provides information associated with an opened file. The regular
-files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos'
-represents the current offset of the opened file in decimal form [see lseek(2)
-for details], 'flags' denotes the octal O_xxx mask the file has been
-created with [see open(2) for details] and 'mnt_id' represents mount ID of
-the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
-for details].
+files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
+The 'pos' represents the current offset of the opened file in decimal
+form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the
+file has been created with [see open(2) for details] and 'mnt_id' represents
+mount ID of the file system containing the opened file [see 3.5
+/proc/<pid>/mountinfo for details]. 'ino' represents the inode number of
+the file.
A typical output is::
pos: 0
flags: 0100002
mnt_id: 19
+ ino: 63107
All locks associated with a file descriptor are shown in its fdinfo too::
@@ -1930,6 +1953,7 @@ Eventfd files
pos: 0
flags: 04002
mnt_id: 9
+ ino: 63107
eventfd-count: 5a
where 'eventfd-count' is hex value of a counter.
@@ -1942,6 +1966,7 @@ Signalfd files
pos: 0
flags: 04002
mnt_id: 9
+ ino: 63107
sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated
@@ -1955,6 +1980,7 @@ Epoll files
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form,
@@ -1971,9 +1997,11 @@ For inotify files the format is the following::
pos: 0
flags: 02000000
+ mnt_id: 9
+ ino: 63107
inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
-where 'wd' is a watch descriptor in decimal form, ie a target file
+where 'wd' is a watch descriptor in decimal form, i.e. a target file
descriptor number, 'ino' and 'sdev' are inode and device where the
target file resides and the 'mask' is the mask of events, all in hex
form [see inotify(7) for more details].
@@ -1993,6 +2021,7 @@ For fanotify files the format is::
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
fanotify flags:10 event-flags:0
fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
@@ -2000,10 +2029,10 @@ For fanotify files the format is::
where fanotify 'flags' and 'event-flags' are values used in fanotify_init
call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
flags associated with mark which are tracked separately from events
-mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
+mask. 'ino' and 'sdev' are target inode and device, 'mask' is the events
mask and 'ignored_mask' is the mask of events which are to be ignored.
-All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
-does provide information about flags and mask used in fanotify_mark
+All are in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
+provide information about flags and mask used in fanotify_mark
call [see fsnotify manpage for details].
While the first three lines are mandatory and always printed, the rest is
@@ -2017,6 +2046,7 @@ Timerfd files
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
clockid: 0
ticks: 0
settime flags: 01
@@ -2026,11 +2056,27 @@ Timerfd files
where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
that have occurred [see timerfd_create(2) for details]. 'settime flags' are
flags in octal form been used to setup the timer [see timerfd_settime(2) for
-details]. 'it_value' is remaining time until the timer exiration.
+details]. 'it_value' is remaining time until the timer expiration.
'it_interval' is the interval for the timer. Note the timer might be set up
with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
still exhibits timer's remaining time.
+DMA Buffer files
+~~~~~~~~~~~~~~~~
+
+::
+
+ pos: 0
+ flags: 04002
+ mnt_id: 9
+ ino: 63107
+ size: 32768
+ count: 2
+ exp_name: system-heap
+
+where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of
+the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter.
+
3.9 /proc/<pid>/map_files - Information about memory mapped files
---------------------------------------------------------------------
This directory contains symbolic links which represent memory mapped files
@@ -2056,13 +2102,13 @@ are actually shared.
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
---------------------------------------------------------
This file provides the value of the task's timerslack value in nanoseconds.
-This value specifies a amount of time that normal timers may be deferred
+This value specifies an amount of time that normal timers may be deferred
in order to coalesce timers and avoid unnecessary wakeups.
-This allows a task's interactivity vs power consumption trade off to be
+This allows a task's interactivity vs power consumption tradeoff to be
adjusted.
-Writing 0 to the file will set the tasks timerslack to the default value.
+Writing 0 to the file will set the task's timerslack to the default value.
Valid values are from 0 - ULLONG_MAX
@@ -2102,10 +2148,10 @@ Example
Description
~~~~~~~~~~~
-x86 specific entries:
+x86 specific entries
~~~~~~~~~~~~~~~~~~~~~
-AVX512_elapsed_ms:
+AVX512_elapsed_ms
^^^^^^^^^^^^^^^^^^
If AVX512 is supported on the machine, this entry shows the milliseconds
@@ -2131,8 +2177,24 @@ AVX512_elapsed_ms:
the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above.
-Configuring procfs
-------------------
+3.13 /proc/<pid>/fd - List of symlinks to open files
+-------------------------------------------------------
+This directory contains symbolic links which represent open files
+the process is maintaining. Example output::
+
+ lr-x------ 1 root root 64 Sep 20 17:53 0 -> /dev/null
+ l-wx------ 1 root root 64 Sep 20 17:53 1 -> /dev/null
+ lrwx------ 1 root root 64 Sep 20 17:53 10 -> 'socket:[12539]'
+ lrwx------ 1 root root 64 Sep 20 17:53 11 -> 'socket:[12540]'
+ lrwx------ 1 root root 64 Sep 20 17:53 12 -> 'socket:[12542]'
+
+The number of open files for the process is stored in 'size' member
+of stat() output for /proc/<pid>/fd for fast access.
+-------------------------------------------------------
+
+
+Chapter 4: Configuring procfs
+=============================
4.1 Mount options
---------------------
@@ -2142,28 +2204,78 @@ The following mount options are supported:
========= ========================================================
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
+ subset= Show only the specified subset of procfs.
========= ========================================================
-hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
-(default).
-
-hidepid=1 means users may not access any /proc/<pid>/ directories but their
-own. Sensitive files like cmdline, sched*, status are now protected against
-other users. This makes it impossible to learn whether any user runs
-specific program (given the program doesn't reveal itself by its behaviour).
-As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
-poorly written programs passing sensitive information via program arguments are
-now protected against local eavesdroppers.
-
-hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
-users. It doesn't mean that it hides a fact whether a process with a specific
-pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
-but it hides process' uid and gid, which may be learned by stat()'ing
-/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering
-information about running processes, whether some daemon runs with elevated
-privileges, whether other user runs some sensitive program, whether other users
-run any program at all, etc.
+hidepid=off or hidepid=0 means classic mode - everybody may access all
+/proc/<pid>/ directories (default).
+
+hidepid=noaccess or hidepid=1 means users may not access any /proc/<pid>/
+directories but their own. Sensitive files like cmdline, sched*, status are now
+protected against other users. This makes it impossible to learn whether any
+user runs specific program (given the program doesn't reveal itself by its
+behaviour). As an additional bonus, as /proc/<pid>/cmdline is unaccessible for
+other users, poorly written programs passing sensitive information via program
+arguments are now protected against local eavesdroppers.
+
+hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
+fully invisible to other users. It doesn't mean that it hides a fact whether a
+process with a specific pid value exists (it can be learned by other means, e.g.
+by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
+gathering information about running processes, whether some daemon runs with
+elevated privileges, whether other user runs some sensitive program, whether
+other users run any program at all, etc.
+
+hidepid=ptraceable or hidepid=4 means that procfs should only contain
+/proc/<pid>/ directories that the caller can ptrace.
gid= defines a group authorized to learn processes information otherwise
prohibited by hidepid=. If you use some daemon like identd which needs to learn
information about processes information, just add identd to this group.
+
+subset=pid hides all top level files and directories in the procfs that
+are not related to tasks.
+
+Chapter 5: Filesystem behavior
+==============================
+
+Originally, before the advent of pid namespace, procfs was a global file
+system. It means that there was only one procfs instance in the system.
+
+When pid namespace was added, a separate procfs instance was mounted in
+each pid namespace. So, procfs mount options are global among all
+mountpoints within the same namespace::
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
+
+ # strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+ mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+ +++ exited with 0 +++
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+
+and only after remounting procfs mount options will change at all
+mountpoints::
+
+ # mount -o remount,hidepid=1 -t proc proc /tmp/proc
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=1 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+
+This behavior is different from the behavior of other filesystems.
+
+The new procfs behavior is more like other filesystems. Each procfs mount
+creates a new procfs instance. Mount options affect own procfs instance.
+It means that it became possible to have several procfs instances
+displaying tasks with different filtering options in one pid namespace::
+
+ # mount -o hidepid=invisible -t proc proc /proc
+ # mount -o hidepid=noaccess -t proc proc /tmp/proc
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=invisible 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst
index fd13433d362c..560f3d470422 100644
--- a/Documentation/filesystems/qnx6.rst
+++ b/Documentation/filesystems/qnx6.rst
@@ -135,7 +135,7 @@ inode.
Character and block special devices do not exist in QNX as those files
are handled by the QNX kernel/drivers and created in /dev independent of the
-underlaying filesystem.
+underlying filesystem.
Long filenames
--------------
@@ -176,7 +176,7 @@ Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
-Each superblock got it's own half of the system area. So superblock #1
+Each superblock got its own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.rst
index 32874b06ebe9..abd4303c546e 100644
--- a/Documentation/filesystems/quota.txt
+++ b/Documentation/filesystems/quota.rst
@@ -1,4 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+===============
Quota subsystem
===============
@@ -16,7 +18,7 @@ Quota limits (and amount of grace time) are set independently for each
filesystem.
For more details about quota design, see the documentation in quota-tools package
-(http://sourceforge.net/projects/linuxquota).
+(https://sourceforge.net/projects/linuxquota).
Quota netlink interface
=======================
@@ -29,16 +31,17 @@ the above events to userspace. There they can be captured by an application
and processed accordingly.
The interface uses generic netlink framework (see
-http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more
-details about this layer). The name of the quota generic netlink interface
-is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>.
-Since the quota netlink protocol is not namespace aware, quota netlink messages
-are sent only in initial network namespace.
+https://lwn.net/Articles/208755/ and http://www.infradead.org/~tgr/libnl/ for
+more details about this layer). The name of the quota generic netlink interface
+is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. Since
+the quota netlink protocol is not namespace aware, quota netlink messages are
+sent only in initial network namespace.
Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
This command is used to send a notification about any of the above mentioned
events. Each message has six attributes. These are (type of the argument is
in parentheses):
+
QUOTA_NL_A_QTYPE (u32)
- type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
QUOTA_NL_A_EXCESS_ID (u64)
@@ -48,20 +51,34 @@ in parentheses):
- UID of a user who caused the event
QUOTA_NL_A_WARNING (u32)
- what kind of limit is exceeded:
- QUOTA_NL_IHARDWARN - inode hardlimit
- QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer
- than given grace period
- QUOTA_NL_ISOFTWARN - inode softlimit
- QUOTA_NL_BHARDWARN - space (block) hardlimit
- QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded
- longer than given grace period.
- QUOTA_NL_BSOFTWARN - space (block) softlimit
+
+ QUOTA_NL_IHARDWARN
+ inode hardlimit
+ QUOTA_NL_ISOFTLONGWARN
+ inode softlimit is exceeded longer
+ than given grace period
+ QUOTA_NL_ISOFTWARN
+ inode softlimit
+ QUOTA_NL_BHARDWARN
+ space (block) hardlimit
+ QUOTA_NL_BSOFTLONGWARN
+ space (block) softlimit is exceeded
+ longer than given grace period.
+ QUOTA_NL_BSOFTWARN
+ space (block) softlimit
+
- four warnings are also defined for the event when user stops
exceeding some limit:
- QUOTA_NL_IHARDBELOW - inode hardlimit
- QUOTA_NL_ISOFTBELOW - inode softlimit
- QUOTA_NL_BHARDBELOW - space (block) hardlimit
- QUOTA_NL_BSOFTBELOW - space (block) softlimit
+
+ QUOTA_NL_IHARDBELOW
+ inode hardlimit
+ QUOTA_NL_ISOFTBELOW
+ inode softlimit
+ QUOTA_NL_BHARDBELOW
+ space (block) hardlimit
+ QUOTA_NL_BSOFTBELOW
+ space (block) softlimit
+
QUOTA_NL_A_DEV_MAJOR (u32)
- major number of a device with the affected filesystem
QUOTA_NL_A_DEV_MINOR (u32)
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 6c576e241d86..447f767c6462 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -6,8 +6,7 @@ Ramfs, rootfs and initramfs
October 17, 2005
-Rob Landley <rob@landley.net>
-=============================
+:Author: Rob Landley <rob@landley.net>
What is ramfs?
--------------
@@ -71,7 +70,7 @@ be allowed write access to a ramfs mount.
A ramfs derivative called tmpfs was created to add size limits, and the ability
to write the data to swap space. Normal users can be allowed write access to
-tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
+tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information.
What is rootfs?
---------------
@@ -170,7 +169,7 @@ Documentation/driver-api/early-userspace/early_userspace_support.rst for more de
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
-usr/gen_initramfs_list.sh), and proceeds to package up that directory
+usr/gen_initramfs.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
@@ -246,15 +245,15 @@ If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
-- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
-- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+- https://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+- https://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
- http://www.linuxfromscratch.org/lfs/view/stable/
-The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+The "klibc" package (https://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
-I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+I use uClibc (https://www.uclibc.org) and busybox (https://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.3 release.)
diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.rst
index d412b236a9d6..1e1713d00010 100644
--- a/Documentation/filesystems/seq_file.txt
+++ b/Documentation/filesystems/seq_file.rst
@@ -1,8 +1,13 @@
-The seq_file interface
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+The seq_file Interface
+======================
Copyright 2003 Jonathan Corbet <corbet@lwn.net>
+
This file is originally from the LWN.net Driver Porting series at
- http://lwn.net/Articles/driver-porting/
+ https://lwn.net/Articles/driver-porting/
There are numerous ways for a device driver (or other kernel component) to
@@ -43,7 +48,7 @@ loadable module which creates a file called /proc/sequence. The file, when
read, simply produces a set of increasing integer values, one per line. The
sequence will continue until the user loses patience and finds something
better to do. The file is seekable, in that one can do something like the
-following:
+following::
dd if=/proc/sequence of=out1 count=1
dd if=/proc/sequence skip=1 of=out2 count=1
@@ -52,19 +57,21 @@ Then concatenate the output files out1 and out2 and get the right
result. Yes, it is a thoroughly useless module, but the point is to show
how the mechanism works without getting lost in other details. (Those
wanting to see the full source for this module can find it at
-http://lwn.net/Articles/22359/).
+https://lwn.net/Articles/22359/).
Deprecated create_proc_entry
+============================
Note that the above article uses create_proc_entry which was removed in
-kernel 3.10. Current versions require the following update
+kernel 3.10. Current versions require the following update::
-- entry = create_proc_entry("sequence", 0, NULL);
-- if (entry)
-- entry->proc_fops = &ct_file_ops;
-+ entry = proc_create("sequence", 0, NULL, &ct_file_ops);
+ - entry = create_proc_entry("sequence", 0, NULL);
+ - if (entry)
+ - entry->proc_fops = &ct_file_ops;
+ + entry = proc_create("sequence", 0, NULL, &ct_file_ops);
The iterator interface
+======================
Modules implementing a virtual file with seq_file must implement an
iterator object that allows stepping through the data of interest
@@ -99,7 +106,7 @@ position. The pos passed to start() will always be either zero, or
the most recent pos used in the previous session.
For our simple sequence example,
-the start() function looks like:
+the start() function looks like::
static void *ct_seq_start(struct seq_file *s, loff_t *pos)
{
@@ -122,14 +129,16 @@ also a special value which can be returned by the start() function
called SEQ_START_TOKEN; it can be used if you wish to instruct your
show() function (described below) to print a header at the top of the
output. SEQ_START_TOKEN should only be used if the offset is zero,
-however.
+however. SEQ_START_TOKEN has no special meaning to the core seq_file
+code. It is provided as a convenience for a start() function to
+communicate with the next() and show() functions.
The next function to implement is called, amazingly, next(); its job is to
move the iterator forward to the next position in the sequence. The
example module can simply increment the position by one; more useful
modules will do what is needed to step through some data structure. The
next() function returns a new iterator, or NULL if the sequence is
-complete. Here's the example version:
+complete. Here's the example version::
static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
@@ -138,13 +147,29 @@ complete. Here's the example version:
return spos;
}
+The next() function should set ``*pos`` to a value that start() can use
+to find the new location in the sequence. When the iterator is being
+stored in the private data area, rather than being reinitialized on each
+start(), it might seem sufficient to simply set ``*pos`` to any non-zero
+value (zero always tells start() to restart the sequence). This is not
+sufficient due to historical problems.
+
+Historically, many next() functions have *not* updated ``*pos`` at
+end-of-file. If the value is then used by start() to initialise the
+iterator, this can result in corner cases where the last entry in the
+sequence is reported twice in the file. In order to discourage this bug
+from being resurrected, the core seq_file code now produces a warning if
+a next() function does not change the value of ``*pos``. Consequently a
+next() function *must* change the value of ``*pos``, and of course must
+set it to a non-zero value.
+
The stop() function closes a session; its job, of course, is to clean
up. If dynamic memory is allocated for the iterator, stop() is the
place to free it; if a lock was taken by start(), stop() must release
-that lock. The value that *pos was set to by the last next() call
+that lock. The value that ``*pos`` was set to by the last next() call
before stop() is remembered, and used for the first start() call of
the next session unless lseek() has been called on the file; in that
-case next start() will be asked to start at position zero.
+case next start() will be asked to start at position zero::
static void ct_seq_stop(struct seq_file *s, void *v)
{
@@ -152,7 +177,7 @@ case next start() will be asked to start at position zero.
}
Finally, the show() function should format the object currently pointed to
-by the iterator for output. The example module's show() function is:
+by the iterator for output. The example module's show() function is::
static int ct_seq_show(struct seq_file *s, void *v)
{
@@ -169,7 +194,7 @@ generated output before returning SEQ_SKIP, that output will be dropped.
We will look at seq_printf() in a moment. But first, the definition of the
seq_file iterator is finished by creating a seq_operations structure with
-the four functions we have just defined:
+the four functions we have just defined::
static const struct seq_operations ct_seq_ops = {
.start = ct_seq_start,
@@ -192,8 +217,15 @@ between the calls to start() and stop(), so holding a lock during that time
is a reasonable thing to do. The seq_file code will also avoid taking any
other locks while the iterator is active.
+The iterator value returned by start() or next() is guaranteed to be
+passed to a subsequent next() or stop() call. This allows resources
+such as locks that were taken to be reliably released. There is *no*
+guarantee that the iterator will be passed to show(), though in practice
+it often will be.
+
Formatted output
+================
The seq_file code manages positioning within the output created by the
iterator and getting it into the user's buffer. But, for that to work, that
@@ -203,7 +235,7 @@ been defined which make this task easy.
Most code will simply use seq_printf(), which works pretty much like
printk(), but which requires the seq_file pointer as an argument.
-For straight character output, the following functions may be used:
+For straight character output, the following functions may be used::
seq_putc(struct seq_file *m, char c);
seq_puts(struct seq_file *m, const char *s);
@@ -213,7 +245,7 @@ The first two output a single character and a string, just like one would
expect. seq_escape() is like seq_puts(), except that any character in s
which is in the string esc will be represented in octal form in the output.
-There are also a pair of functions for printing filenames:
+There are also a pair of functions for printing filenames::
int seq_path(struct seq_file *m, const struct path *path,
const char *esc);
@@ -226,8 +258,10 @@ the path relative to the current process's filesystem root. If a different
root is desired, it can be used with seq_path_root(). If it turns out that
path cannot be reached from root, seq_path_root() returns SEQ_SKIP.
-A function producing complicated output may want to check
+A function producing complicated output may want to check::
+
bool seq_has_overflowed(struct seq_file *m);
+
and avoid further seq_<output> calls if true is returned.
A true return from seq_has_overflowed means that the seq_file buffer will
@@ -236,6 +270,7 @@ buffer and retry printing.
Making it all work
+==================
So far, we have a nice set of functions which can produce output within the
seq_file system, but we have not yet turned them into a file that a user
@@ -244,7 +279,7 @@ creation of a set of file_operations which implement the operations on that
file. The seq_file interface provides a set of canned operations which do
most of the work. The virtual file author still must implement the open()
method, however, to hook everything up. The open function is often a single
-line, as in the example module:
+line, as in the example module::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -263,7 +298,7 @@ by the iterator functions.
There is also a wrapper function to seq_open() called seq_open_private(). It
kmallocs a zero filled block of memory and stores a pointer to it in the
private field of the seq_file structure, returning 0 on success. The
-block size is specified in a third parameter to the function, e.g.:
+block size is specified in a third parameter to the function, e.g.::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -273,7 +308,7 @@ block size is specified in a third parameter to the function, e.g.:
There is also a variant function, __seq_open_private(), which is functionally
identical except that, if successful, it returns the pointer to the allocated
-memory block, allowing further initialisation e.g.:
+memory block, allowing further initialisation e.g.::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -295,7 +330,7 @@ frees the memory allocated in the corresponding open.
The other operations of interest - read(), llseek(), and release() - are
all implemented by the seq_file code itself. So a virtual file's
-file_operations structure will look like:
+file_operations structure will look like::
static const struct file_operations ct_file_ops = {
.owner = THIS_MODULE,
@@ -309,7 +344,7 @@ There is also a seq_release_private() which passes the contents of the
seq_file private field to kfree() before releasing the structure.
The final step is the creation of the /proc file itself. In the example
-code, that is done in the initialization code in the usual way:
+code, that is done in the initialization code in the usual way::
static int ct_init(void)
{
@@ -325,9 +360,10 @@ And that is pretty much it.
seq_list
+========
If your file will be iterating through a linked list, you may find these
-routines useful:
+routines useful::
struct list_head *seq_list_start(struct list_head *head,
loff_t pos);
@@ -338,15 +374,16 @@ routines useful:
These helpers will interpret pos as a position within the list and iterate
accordingly. Your start() and next() functions need only invoke the
-seq_list_* helpers with a pointer to the appropriate list_head structure.
+``seq_list_*`` helpers with a pointer to the appropriate list_head structure.
The extra-simple version
+========================
For extremely simple virtual files, there is an even easier interface. A
module can define only the show() function, which should create all the
output that the virtual file will contain. The file's open() method then
-calls:
+calls::
int single_open(struct file *file,
int (*show)(struct seq_file *m, void *p),
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.rst
index 8ccfbd55244b..1cf56489ed48 100644
--- a/Documentation/filesystems/sharedsubtree.txt
+++ b/Documentation/filesystems/sharedsubtree.rst
@@ -1,7 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
Shared Subtrees
----------------
+===============
-Contents:
+.. Contents:
1) Overview
2) Features
3) Setting mount states
@@ -41,31 +44,38 @@ replicas continue to be exactly same.
Here is an example:
- Let's say /mnt has a mount that is shared.
- mount --make-shared /mnt
+ Let's say /mnt has a mount that is shared::
+
+ mount --make-shared /mnt
Note: mount(8) command now supports the --make-shared flag,
so the sample 'smount' program is no longer needed and has been
removed.
- # mount --bind /mnt /tmp
+ ::
+
+ # mount --bind /mnt /tmp
+
The above command replicates the mount at /mnt to the mountpoint /tmp
and the contents of both the mounts remain identical.
- #ls /mnt
- a b c
+ ::
- #ls /tmp
- a b c
+ #ls /mnt
+ a b c
- Now let's say we mount a device at /tmp/a
- # mount /dev/sd0 /tmp/a
+ #ls /tmp
+ a b c
- #ls /tmp/a
- t1 t2 t3
+ Now let's say we mount a device at /tmp/a::
- #ls /mnt/a
- t1 t2 t3
+ # mount /dev/sd0 /tmp/a
+
+ #ls /tmp/a
+ t1 t2 t3
+
+ #ls /mnt/a
+ t1 t2 t3
Note that the mount has propagated to the mount at /mnt as well.
@@ -123,27 +133,29 @@ replicas continue to be exactly same.
2d) A unbindable mount is a unbindable private mount
- let's say we have a mount at /mnt and we make it unbindable
+ let's say we have a mount at /mnt and we make it unbindable::
+
+ # mount --make-unbindable /mnt
- # mount --make-unbindable /mnt
+ Let's try to bind mount this mount somewhere else::
- Let's try to bind mount this mount somewhere else.
- # mount --bind /mnt /tmp
- mount: wrong fs type, bad option, bad superblock on /mnt,
- or too many mounted file systems
+ # mount --bind /mnt /tmp
+ mount: wrong fs type, bad option, bad superblock on /mnt,
+ or too many mounted file systems
Binding a unbindable mount is a invalid operation.
3) Setting mount states
+-----------------------
The mount command (util-linux package) can be used to set mount
- states:
+ states::
- mount --make-shared mountpoint
- mount --make-slave mountpoint
- mount --make-private mountpoint
- mount --make-unbindable mountpoint
+ mount --make-shared mountpoint
+ mount --make-slave mountpoint
+ mount --make-private mountpoint
+ mount --make-unbindable mountpoint
4) Use cases
@@ -154,9 +166,10 @@ replicas continue to be exactly same.
Solution:
- The system administrator can make the mount at /cdrom shared
- mount --bind /cdrom /cdrom
- mount --make-shared /cdrom
+ The system administrator can make the mount at /cdrom shared::
+
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
Now any process that clones off a new namespace will have a
mount at /cdrom which is a replica of the same mount in the
@@ -172,14 +185,14 @@ replicas continue to be exactly same.
Solution:
To begin with, the administrator can mark the entire mount tree
- as shareable.
+ as shareable::
- mount --make-rshared /
+ mount --make-rshared /
A new process can clone off a new namespace. And mark some part
- of its namespace as slave
+ of its namespace as slave::
- mount --make-rslave /myprivatetree
+ mount --make-rslave /myprivatetree
Hence forth any mounts within the /myprivatetree done by the
process will not show up in any other namespace. However mounts
@@ -206,13 +219,13 @@ replicas continue to be exactly same.
versions of the file depending on the path used to access that
file.
- An example is:
+ An example is::
- mount --make-shared /
- mount --rbind / /view/v1
- mount --rbind / /view/v2
- mount --rbind / /view/v3
- mount --rbind / /view/v4
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
and if /usr has a versioning filesystem mounted, then that
mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
@@ -224,8 +237,8 @@ replicas continue to be exactly same.
filesystem is being requested and return the corresponding
inode.
-5) Detailed semantics:
--------------------
+5) Detailed semantics
+---------------------
The section below explains the detailed semantics of
bind, rbind, move, mount, umount and clone-namespace operations.
@@ -235,6 +248,7 @@ replicas continue to be exactly same.
5a) Mount states
A given mount can be in one of the following states
+
1) shared
2) slave
3) shared and slave
@@ -252,7 +266,8 @@ replicas continue to be exactly same.
A 'shared mount' is defined as a vfsmount that belongs to a
'peer group'.
- For example:
+ For example::
+
mount --make-shared /mnt
mount --bind /mnt /tmp
@@ -270,7 +285,7 @@ replicas continue to be exactly same.
A slave mount as the name implies has a master mount from which
mount/unmount events are received. Events do not propagate from
the slave mount to the master. Only a shared mount can be made
- a slave by executing the following command
+ a slave by executing the following command::
mount --make-slave mount
@@ -290,8 +305,10 @@ replicas continue to be exactly same.
peer group.
Only a slave vfsmount can be made as 'shared and slave' by
- either executing the following command
+ either executing the following command::
+
mount --make-shared mount
+
or by moving the slave vfsmount under a shared vfsmount.
(4) Private mount
@@ -307,30 +324,32 @@ replicas continue to be exactly same.
State diagram:
+
The state diagram below explains the state transition of a mount,
- in response to various commands.
- ------------------------------------------------------------------------
- | |make-shared | make-slave | make-private |make-unbindab|
- --------------|------------|--------------|--------------|-------------|
- |shared |shared |*slave/private| private | unbindable |
- | | | | | |
- |-------------|------------|--------------|--------------|-------------|
- |slave |shared | **slave | private | unbindable |
- | |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |shared |shared | slave | private | unbindable |
- |and slave |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |private |shared | **private | private | unbindable |
- |-------------|------------|--------------|--------------|-------------|
- |unbindable |shared |**unbindable | private | unbindable |
- ------------------------------------------------------------------------
-
- * if the shared mount is the only mount in its peer group, making it
- slave, makes it private automatically. Note that there is no master to
- which it can be slaved to.
-
- ** slaving a non-shared mount has no effect on the mount.
+ in response to various commands::
+
+ -----------------------------------------------------------------------
+ | |make-shared | make-slave | make-private |make-unbindab|
+ --------------|------------|--------------|--------------|-------------|
+ |shared |shared |*slave/private| private | unbindable |
+ | | | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |slave |shared | **slave | private | unbindable |
+ | |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |shared |shared | slave | private | unbindable |
+ |and slave |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |private |shared | **private | private | unbindable |
+ |-------------|------------|--------------|--------------|-------------|
+ |unbindable |shared |**unbindable | private | unbindable |
+ ------------------------------------------------------------------------
+
+ * if the shared mount is the only mount in its peer group, making it
+ slave, makes it private automatically. Note that there is no master to
+ which it can be slaved to.
+
+ ** slaving a non-shared mount has no effect on the mount.
Apart from the commands listed below, the 'move' operation also changes
the state of a mount depending on type of the destination mount. Its
@@ -338,31 +357,32 @@ replicas continue to be exactly same.
5b) Bind semantics
- Consider the following command
+ Consider the following command::
- mount --bind A/a B/b
+ mount --bind A/a B/b
where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
is the destination mount and 'b' is the dentry in the destination mount.
The outcome depends on the type of mount of 'A' and 'B'. The table
- below contains quick reference.
- ---------------------------------------------------------------------------
- | BIND MOUNT OPERATION |
- |**************************************************************************
- |source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared | shared & slave | invalid |
- | | | | | |
- |non-shared| shared | private | slave | invalid |
- ***************************************************************************
+ below contains quick reference::
+
+ --------------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |************************************************************************|
+ |source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |************************************************************************|
+ | shared | shared | shared | shared & slave | invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | invalid |
+ **************************************************************************
Details:
- 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
@@ -371,7 +391,7 @@ replicas continue to be exactly same.
'B'. And finally the peer-group of 'C' is merged with the peer group
of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
@@ -379,7 +399,7 @@ replicas continue to be exactly same.
'C', 'C1', .., 'Cn' with exactly the same configuration as the
propagation tree for 'B'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
'C3' ... are created and mounted at the dentry 'b' on all mounts where
@@ -389,19 +409,19 @@ replicas continue to be exactly same.
is made the slave of mount 'Z'. In other words, mount 'C' is in the
state 'slave and shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
invalid operation.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. A new mount 'C' which is clone of 'A', is created.
Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
peer-group of 'A'.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
new mount 'C' which is a clone of 'A' is created. Its root dentry is
'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
@@ -409,7 +429,7 @@ replicas continue to be exactly same.
mount/unmount on 'A' do not propagate anywhere else. Similarly
mount/unmount on 'C' do not propagate anywhere else.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
invalid operation. A unbindable mount cannot be bind mounted.
5c) Rbind semantics
@@ -422,7 +442,9 @@ replicas continue to be exactly same.
then the subtree under the unbindable mount is pruned in the new
location.
- eg: let's say we have the following mount tree.
+ eg:
+
+ let's say we have the following mount tree::
A
/ \
@@ -430,12 +452,12 @@ replicas continue to be exactly same.
/ \ / \
D E F G
- Let's say all the mount except the mount C in the tree are
- of a type other than unbindable.
+ Let's say all the mount except the mount C in the tree are
+ of a type other than unbindable.
- If this tree is rbound to say Z
+ If this tree is rbound to say Z
- We will have the following tree at the new location.
+ We will have the following tree at the new location::
Z
|
@@ -457,24 +479,26 @@ replicas continue to be exactly same.
the dentry in the destination mount.
The outcome depends on the type of the mount of 'A' and 'B'. The table
- below is a quick reference.
- ---------------------------------------------------------------------------
- | MOVE MOUNT OPERATION |
- |**************************************************************************
- | source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared |shared and slave| invalid |
- | | | | | |
- |non-shared| shared | private | slave | unbindable |
- ***************************************************************************
- NOTE: moving a mount residing under a shared mount is invalid.
+ below is a quick reference::
+
+ ---------------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |**************************************************************************
+ | source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared |shared and slave| invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | unbindable |
+ ***************************************************************************
+
+ .. Note:: moving a mount residing under a shared mount is invalid.
Details follow:
- 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. A new propagation tree is created in the
@@ -483,7 +507,7 @@ replicas continue to be exactly same.
propagation tree is appended to the already existing propagation tree
of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. The mount 'A' becomes a shared mount and a
@@ -491,7 +515,7 @@ replicas continue to be exactly same.
'B'. This new propagation tree contains all the new mounts 'A1',
'A2'... 'An'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
receive propagation from mount 'B'. A new propagation tree is created
@@ -501,32 +525,32 @@ replicas continue to be exactly same.
'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
becomes 'shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
is invalid. Because mounting anything on the shared mount 'B' can
create new mounts that get mounted on the mounts that receive
propagation from 'B'. And since the mount 'A' is unbindable, cloning
it to mount at other mountpoints is not possible.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
shared mount.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
continues to be a slave mount of mount 'Z'.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
unbindable mount.
5e) Mount semantics
- Consider the following command
+ Consider the following command::
- mount device B/b
+ mount device B/b
'B' is the destination mount and 'b' is the dentry in the destination
mount.
@@ -537,9 +561,9 @@ replicas continue to be exactly same.
5f) Unmount semantics
- Consider the following command
+ Consider the following command::
- umount A
+ umount A
where 'A' is a mount mounted on mount 'B' at dentry 'b'.
@@ -589,13 +613,16 @@ replicas continue to be exactly same.
6) Quiz
+-------
A. What is the result of the following command sequence?
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mount --bind /mnt /tmp
- mount --move /tmp /mnt/1
+ ::
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
Should they all be identical? or should /mnt and /mnt/1 be
@@ -604,23 +631,27 @@ replicas continue to be exactly same.
B. What is the result of the following command sequence?
- mount --make-rshared /
- mkdir -p /v/1
- mount --rbind / /v/1
+ ::
+
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
what should be the content of /v/1/v/1 be?
C. What is the result of the following command sequence?
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mkdir -p /mnt/1/2/3 /mnt/1/test
- mount --bind /mnt/1 /tmp
- mount --make-slave /mnt
- mount --make-shared /mnt
- mount --bind /mnt/1/2 /tmp1
- mount --make-slave /mnt
+ ::
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
At this point we have the first mount at /tmp and
its root dentry is 1. Let's call this mount 'A'
@@ -644,6 +675,7 @@ replicas continue to be exactly same.
/mnt/1/test be?
7) FAQ
+------
Q1. Why is bind mount needed? How is it different from symbolic links?
symbolic links can get stale if the destination mount gets
@@ -668,7 +700,8 @@ replicas continue to be exactly same.
step 1:
let's say the root tree has just two directories with
- one vfsmount.
+ one vfsmount::
+
root
/ \
tmp usr
@@ -676,14 +709,17 @@ replicas continue to be exactly same.
And we want to replicate the tree at multiple
mountpoints under /root/tmp
- step2:
- mount --make-shared /root
+ step 2:
+ ::
- mkdir -p /tmp/m1
- mount --rbind /root /tmp/m1
+ mount --make-shared /root
- the new tree now looks like this:
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this::
root
/ \
@@ -697,11 +733,13 @@ replicas continue to be exactly same.
it has two vfsmounts
- step3:
+ step 3:
+ ::
+
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -724,6 +762,7 @@ replicas continue to be exactly same.
it has 6 vfsmounts
step 4:
+ ::
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
@@ -740,7 +779,8 @@ replicas continue to be exactly same.
step 1:
let's say the root tree has just two directories with
- one vfsmount.
+ one vfsmount::
+
root
/ \
tmp usr
@@ -748,17 +788,20 @@ replicas continue to be exactly same.
How do we set up the same tree at multiple locations under
/root/tmp
- step2:
- mount --bind /root/tmp /root/tmp
+ step 2:
+ ::
- mount --make-rshared /root
- mount --make-unbindable /root/tmp
- mkdir -p /tmp/m1
+ mount --bind /root/tmp /root/tmp
- mount --rbind /root /tmp/m1
+ mount --make-rshared /root
+ mount --make-unbindable /root/tmp
- the new tree now looks like this:
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this::
root
/ \
@@ -768,11 +811,13 @@ replicas continue to be exactly same.
/ \
tmp usr
- step3:
+ step 3:
+ ::
+
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -782,12 +827,13 @@ replicas continue to be exactly same.
/ \ / \
tmp usr tmp usr
- step4:
+ step 4:
+ ::
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -798,28 +844,35 @@ replicas continue to be exactly same.
tmp usr tmp usr tmp usr
8) Implementation
+-----------------
8A) Datastructure
- 4 new fields are introduced to struct vfsmount
- ->mnt_share
- ->mnt_slave_list
- ->mnt_slave
- ->mnt_master
+ 4 new fields are introduced to struct vfsmount:
+
+ * ->mnt_share
+ * ->mnt_slave_list
+ * ->mnt_slave
+ * ->mnt_master
- ->mnt_share links together all the mount to/from which this vfsmount
+ ->mnt_share
+ links together all the mount to/from which this vfsmount
send/receives propagation events.
- ->mnt_slave_list links all the mounts to which this vfsmount propagates
+ ->mnt_slave_list
+ links all the mounts to which this vfsmount propagates
to.
- ->mnt_slave links together all the slaves that its master vfsmount
+ ->mnt_slave
+ links together all the slaves that its master vfsmount
propagates to.
- ->mnt_master points to the master vfsmount from which this vfsmount
+ ->mnt_master
+ points to the master vfsmount from which this vfsmount
receives propagation.
- ->mnt_flags takes two more flags to indicate the propagation status of
+ ->mnt_flags
+ takes two more flags to indicate the propagation status of
the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
replicated.
@@ -842,7 +895,7 @@ replicas continue to be exactly same.
A example propagation tree looks as shown in the figure below.
[ NOTE: Though it looks like a forest, if we consider all the shared
- mounts as a conceptual entity called 'pnode', it becomes a tree]
+ mounts as a conceptual entity called 'pnode', it becomes a tree]::
A <--> B <--> C <---> D
@@ -864,14 +917,19 @@ replicas continue to be exactly same.
A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
E's ->mnt_share links with ->mnt_share of K
- 'E', 'K', 'F', 'G' have their ->mnt_master point to struct
- vfsmount of 'A'
+
+ 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
+
'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+
K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+
J and K's ->mnt_master points to struct vfsmount of C
+
and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+
'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
@@ -903,6 +961,7 @@ replicas continue to be exactly same.
Prepare phase:
for each mount in the source tree:
+
a) Create the necessary number of mount trees to
be attached to each of the mounts that receive
propagation from the destination mount.
@@ -929,11 +988,12 @@ replicas continue to be exactly same.
Abort phase
delete all the newly created trees.
- NOTE: all the propagation related functionality resides in the file
- pnode.c
+ .. Note::
+ all the propagation related functionality resides in the file pnode.c
------------------------------------------------------------------------
version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
+
version 0.2 (Incorporated comments from Al Viro)
diff --git a/Documentation/filesystems/cifs/cifsroot.txt b/Documentation/filesystems/smb/cifsroot.rst
index 947b7ec6ce9e..bf2d9db3acb9 100644
--- a/Documentation/filesystems/cifs/cifsroot.txt
+++ b/Documentation/filesystems/smb/cifsroot.rst
@@ -1,7 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
Mounting root file system via SMB (cifs.ko)
===========================================
Written 2019 by Paulo Alcantara <palcantara@suse.de>
+
Written 2019 by Aurelien Aptel <aaptel@suse.com>
The CONFIG_CIFS_ROOT option enables experimental root file system
@@ -32,7 +36,7 @@ Server configuration
====================
To enable SMB1+UNIX extensions you will need to set these global
-settings in Samba smb.conf:
+settings in Samba smb.conf::
[global]
server min protocol = NT1
@@ -41,17 +45,21 @@ settings in Samba smb.conf:
Kernel command line
===================
-root=/dev/cifs
+::
+
+ root=/dev/cifs
This is just a virtual device that basically tells the kernel to mount
the root file system via SMB protocol.
-cifsroot=//<server-ip>/<share>[,options]
+::
+
+ cifsroot=//<server-ip>/<share>[,options]
Enables the kernel to mount the root file system via SMB that are
located in the <server-ip> and <share> specified in this option.
-The default mount options are set in fs/cifs/cifsroot.c.
+The default mount options are set in fs/smb/client/cifsroot.c.
server-ip
IPv4 address of the server.
@@ -65,33 +73,33 @@ options
Examples
========
-Export root file system as a Samba share in smb.conf file.
+Export root file system as a Samba share in smb.conf file::
-...
-[linux]
- path = /path/to/rootfs
- read only = no
- guest ok = yes
- force user = root
- force group = root
- browseable = yes
- writeable = yes
- admin users = root
- public = yes
- create mask = 0777
- directory mask = 0777
-...
+ ...
+ [linux]
+ path = /path/to/rootfs
+ read only = no
+ guest ok = yes
+ force user = root
+ force group = root
+ browseable = yes
+ writeable = yes
+ admin users = root
+ public = yes
+ create mask = 0777
+ directory mask = 0777
+ ...
-Restart smb service.
+Restart smb service::
-# systemctl restart smb
+ # systemctl restart smb
Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and
-CONFIG_IP_PNP options enabled.
+CONFIG_IP_PNP options enabled::
-# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
- -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
- -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
+ # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
+ -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
+ -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
1: https://wiki.samba.org/index.php/UNIX_Extensions
diff --git a/Documentation/filesystems/smb/index.rst b/Documentation/filesystems/smb/index.rst
new file mode 100644
index 000000000000..1c8597a679ab
--- /dev/null
+++ b/Documentation/filesystems/smb/index.rst
@@ -0,0 +1,10 @@
+===============================
+CIFS
+===============================
+
+
+.. toctree::
+ :maxdepth: 1
+
+ ksmbd
+ cifsroot
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
new file mode 100644
index 000000000000..6b30e43a0d11
--- /dev/null
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+KSMBD - SMB3 Kernel Server
+==========================
+
+KSMBD is a linux kernel server which implements SMB3 protocol in kernel space
+for sharing files over network.
+
+KSMBD architecture
+==================
+
+The subset of performance related operations belong in kernelspace and
+the other subset which belong to operations which are not really related with
+performance in userspace. So, DCE/RPC management that has historically resulted
+into number of buffer overflow issues and dangerous security bugs and user
+account management are implemented in user space as ksmbd.mountd.
+File operations that are related with performance (open/read/write/close etc.)
+in kernel space (ksmbd). This also allows for easier integration with VFS
+interface for all file operations.
+
+ksmbd (kernel daemon)
+---------------------
+
+When the server daemon is started, It starts up a forker thread
+(ksmbd/interface name) at initialization time and open a dedicated port 445
+for listening to SMB requests. Whenever new clients make request, Forker
+thread will accept the client connection and fork a new thread for dedicated
+communication channel between the client and the server. It allows for parallel
+processing of SMB requests(commands) from clients as well as allowing for new
+clients to make new connections. Each instance is named ksmbd/1~n(port number)
+to indicate connected clients. Depending on the SMB request types, each new
+thread can decide to pass through the commands to the user space (ksmbd.mountd),
+currently DCE/RPC commands are identified to be handled through the user space.
+To further utilize the linux kernel, it has been chosen to process the commands
+as workitems and to be executed in the handlers of the ksmbd-io kworker threads.
+It allows for multiplexing of the handlers as the kernel take care of initiating
+extra worker threads if the load is increased and vice versa, if the load is
+decreased it destroys the extra worker threads. So, after connection is
+established with client. Dedicated ksmbd/1..n(port number) takes complete
+ownership of receiving/parsing of SMB commands. Each received command is worked
+in parallel i.e., There can be multiple clients commands which are worked in
+parallel. After receiving each command a separated kernel workitem is prepared
+for each command which is further queued to be handled by ksmbd-io kworkers.
+So, each SMB workitem is queued to the kworkers. This allows the benefit of load
+sharing to be managed optimally by the default kernel and optimizing client
+performance by handling client commands in parallel.
+
+ksmbd.mountd (user space daemon)
+--------------------------------
+
+ksmbd.mountd is userspace process to, transfer user account and password that
+are registered using ksmbd.adduser (part of utils for user space). Further it
+allows sharing information parameters that parsed from smb.conf to ksmbd in
+kernel. For the execution part it has a daemon which is continuously running
+and connected to the kernel interface using netlink socket, it waits for the
+requests (dcerpc and share/user info). It handles RPC calls (at a minimum few
+dozen) that are most important for file server from NetShareEnum and
+NetServerGetInfo. Complete DCE/RPC response is prepared from the user space
+and passed over to the associated kernel thread for the client.
+
+
+KSMBD Feature Status
+====================
+
+============================== =================================================
+Feature name Status
+============================== =================================================
+Dialects Supported. SMB2.1 SMB3.0, SMB3.1.1 dialects
+ (intentionally excludes security vulnerable SMB1
+ dialect).
+Auto Negotiation Supported.
+Compound Request Supported.
+Oplock Cache Mechanism Supported.
+SMB2 leases(v1 lease) Supported.
+Directory leases(v2 lease) Supported.
+Multi-credits Supported.
+NTLM/NTLMv2 Supported.
+HMAC-SHA256 Signing Supported.
+Secure negotiate Supported.
+Signing Update Supported.
+Pre-authentication integrity Supported.
+SMB3 encryption(CCM, GCM) Supported. (CCM/GCM128 and CCM/GCM256 supported)
+SMB direct(RDMA) Supported.
+SMB3 Multi-channel Partially Supported. Planned to implement
+ replay/retry mechanisms for future.
+Receive Side Scaling mode Supported.
+SMB3.1.1 POSIX extension Supported.
+ACLs Partially Supported. only DACLs available, SACLs
+ (auditing) is planned for the future. For
+ ownership (SIDs) ksmbd generates random subauth
+ values(then store it to disk) and use uid/gid
+ get from inode as RID for local domain SID.
+ The current acl implementation is limited to
+ standalone server, not a domain member.
+ Integration with Samba tools is being worked on
+ to allow future support for running as a domain
+ member.
+Kerberos Supported.
+Durable handle v1,v2 Planned for future.
+Persistent handle Planned for future.
+SMB2 notify Planned for future.
+Sparse file support Supported.
+DCE/RPC support Partially Supported. a few calls(NetShareEnumAll,
+ NetServerGetInfo, SAMR, LSARPC) that are needed
+ for file server handled via netlink interface
+ from ksmbd.mountd. Additional integration with
+ Samba tools and libraries via upcall is being
+ investigated to allow support for additional
+ DCE/RPC management calls (and future support
+ for Witness protocol e.g.)
+ksmbd/nfsd interoperability Planned for future. The features that ksmbd
+ support are Leases, Notify, ACLs and Share modes.
+SMB3.1.1 Compression Planned for future.
+SMB3.1.1 over QUIC Planned for future.
+Signing/Encryption over RDMA Planned for future.
+SMB3.1.1 GMAC signing support Planned for future.
+============================== =================================================
+
+
+How to run
+==========
+
+1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and
+ compile them.
+
+ - Refer README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
+ to know how to use ksmbd.mountd/adduser/addshare/control utils
+
+ $ ./autogen.sh
+ $ ./configure --with-rundir=/run
+ $ make && sudo make install
+
+2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file.
+
+ - Refer ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
+ for details to configure shares.
+
+ $ man ksmbd.conf
+
+3. Create user/password for SMB share.
+
+ - See ksmbd.adduser manpage.
+
+ $ man ksmbd.adduser
+ $ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access>
+
+4. Insert ksmbd.ko module after build your kernel. No need to load module
+ if ksmbd is built into the kernel.
+
+ - Set ksmbd in menuconfig(e.g. $ make menuconfig)
+ [*] Network File Systems --->
+ <M> SMB3 server support (EXPERIMENTAL)
+
+ $ sudo modprobe ksmbd.ko
+
+5. Start ksmbd user space daemon
+
+ $ sudo ksmbd.mountd
+
+6. Access share from Windows or Linux using SMB3 client (cifs.ko or smbclient of samba)
+
+Shutdown KSMBD
+==============
+
+1. kill user and kernel space daemon
+ # sudo ksmbd.control -s
+
+How to turn debug print on
+==========================
+
+Each layer
+/sys/class/ksmbd-control/debug
+
+1. Enable all component prints
+ # sudo ksmbd.control -d "all"
+
+2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma)
+ # sudo ksmbd.control -d "smb"
+
+3. Show what prints are enabled.
+ # cat /sys/class/ksmbd-control/debug
+ [smb] auth vfs oplock ipc conn [rdma]
+
+4. Disable prints:
+ If you try the selected component once more, It is disabled without brackets.
diff --git a/Documentation/filesystems/spufs/index.rst b/Documentation/filesystems/spufs/index.rst
new file mode 100644
index 000000000000..5ed4a8494967
--- /dev/null
+++ b/Documentation/filesystems/spufs/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+SPU Filesystem
+==============
+
+
+.. toctree::
+ :maxdepth: 1
+
+ spufs
+ spu_create
+ spu_run
diff --git a/Documentation/filesystems/spufs/spu_create.rst b/Documentation/filesystems/spufs/spu_create.rst
new file mode 100644
index 000000000000..83108c099696
--- /dev/null
+++ b/Documentation/filesystems/spufs/spu_create.rst
@@ -0,0 +1,131 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+spu_create
+==========
+
+Name
+====
+ spu_create - create a new spu context
+
+
+Synopsis
+========
+
+ ::
+
+ #include <sys/types.h>
+ #include <sys/spu.h>
+
+ int spu_create(const char *pathname, int flags, mode_t mode);
+
+Description
+===========
+ The spu_create system call is used on PowerPC machines that implement
+ the Cell Broadband Engine Architecture in order to access Synergistic
+ Processor Units (SPUs). It creates a new logical context for an SPU in
+ pathname and returns a handle to associated with it. pathname must
+ point to a non-existing directory in the mount point of the SPU file
+ system (spufs). When spu_create is successful, a directory gets cre-
+ ated on pathname and it is populated with files.
+
+ The returned file handle can only be passed to spu_run(2) or closed,
+ other operations are not defined on it. When it is closed, all associ-
+ ated directory entries in spufs are removed. When the last file handle
+ pointing either inside of the context directory or to this file
+ descriptor is closed, the logical SPU context is destroyed.
+
+ The parameter flags can be zero or any bitwise or'd combination of the
+ following constants:
+
+ SPU_RAWIO
+ Allow mapping of some of the hardware registers of the SPU into
+ user space. This flag requires the CAP_SYS_RAWIO capability, see
+ capabilities(7).
+
+ The mode parameter specifies the permissions used for creating the new
+ directory in spufs. mode is modified with the user's umask(2) value
+ and then used for both the directory and the files contained in it. The
+ file permissions mask out some more bits of mode because they typically
+ support only read or write access. See stat(2) for a full list of the
+ possible mode values.
+
+
+Return Value
+============
+ spu_create returns a new file descriptor. It may return -1 to indicate
+ an error condition and set errno to one of the error codes listed
+ below.
+
+
+Errors
+======
+ EACCES
+ The current user does not have write access on the spufs mount
+ point.
+
+ EEXIST An SPU context already exists at the given path name.
+
+ EFAULT pathname is not a valid string pointer in the current address
+ space.
+
+ EINVAL pathname is not a directory in the spufs mount point.
+
+ ELOOP Too many symlinks were found while resolving pathname.
+
+ EMFILE The process has reached its maximum open file limit.
+
+ ENAMETOOLONG
+ pathname was too long.
+
+ ENFILE The system has reached the global open file limit.
+
+ ENOENT Part of pathname could not be resolved.
+
+ ENOMEM The kernel could not allocate all resources required.
+
+ ENOSPC There are not enough SPU resources available to create a new
+ context or the user specific limit for the number of SPU con-
+ texts has been reached.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+ ENOTDIR
+ A part of pathname is not a directory.
+
+
+
+Notes
+=====
+ spu_create is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+Files
+=====
+ pathname must point to a location beneath the mount point of spufs. By
+ convention, it gets mounted in /spu.
+
+
+Conforming to
+=============
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+Bugs
+====
+ The code does not yet fully implement all features lined out here.
+
+
+Author
+======
+ Arnd Bergmann <arndb@de.ibm.com>
+
+See Also
+========
+ capabilities(7), close(2), spu_run(2), spufs(7)
diff --git a/Documentation/filesystems/spufs/spu_run.rst b/Documentation/filesystems/spufs/spu_run.rst
new file mode 100644
index 000000000000..7fdb1c31cb91
--- /dev/null
+++ b/Documentation/filesystems/spufs/spu_run.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+spu_run
+=======
+
+
+Name
+====
+ spu_run - execute an spu context
+
+
+Synopsis
+========
+
+ ::
+
+ #include <sys/spu.h>
+
+ int spu_run(int fd, unsigned int *npc, unsigned int *event);
+
+Description
+===========
+ The spu_run system call is used on PowerPC machines that implement the
+ Cell Broadband Engine Architecture in order to access Synergistic Pro-
+ cessor Units (SPUs). It uses the fd that was returned from spu_cre-
+ ate(2) to address a specific SPU context. When the context gets sched-
+ uled to a physical SPU, it starts execution at the instruction pointer
+ passed in npc.
+
+ Execution of SPU code happens synchronously, meaning that spu_run does
+ not return while the SPU is still running. If there is a need to exe-
+ cute SPU code in parallel with other code on either the main CPU or
+ other SPUs, you need to create a new thread of execution first, e.g.
+ using the pthread_create(3) call.
+
+ When spu_run returns, the current value of the SPU instruction pointer
+ is written back to npc, so you can call spu_run again without updating
+ the pointers.
+
+ event can be a NULL pointer or point to an extended status code that
+ gets filled when spu_run returns. It can be one of the following con-
+ stants:
+
+ SPE_EVENT_DMA_ALIGNMENT
+ A DMA alignment error
+
+ SPE_EVENT_SPE_DATA_SEGMENT
+ A DMA segmentation error
+
+ SPE_EVENT_SPE_DATA_STORAGE
+ A DMA storage error
+
+ If NULL is passed as the event argument, these errors will result in a
+ signal delivered to the calling process.
+
+Return Value
+============
+ spu_run returns the value of the spu_status register or -1 to indicate
+ an error and set errno to one of the error codes listed below. The
+ spu_status register value contains a bit mask of status codes and
+ optionally a 14 bit code returned from the stop-and-signal instruction
+ on the SPU. The bit masks for the status codes are:
+
+ 0x02
+ SPU was stopped by stop-and-signal.
+
+ 0x04
+ SPU was stopped by halt.
+
+ 0x08
+ SPU is waiting for a channel.
+
+ 0x10
+ SPU is in single-step mode.
+
+ 0x20
+ SPU has tried to execute an invalid instruction.
+
+ 0x40
+ SPU has tried to access an invalid channel.
+
+ 0x3fff0000
+ The bits masked with this value contain the code returned from
+ stop-and-signal.
+
+ There are always one or more of the lower eight bits set or an error
+ code is returned from spu_run.
+
+Errors
+======
+ EAGAIN or EWOULDBLOCK
+ fd is in non-blocking mode and spu_run would block.
+
+ EBADF fd is not a valid file descriptor.
+
+ EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+ pointer.
+
+ EINTR A signal occurred while spu_run was in progress. The npc value
+ has been updated to the new program counter value if necessary.
+
+ EINVAL fd is not a file descriptor returned from spu_create(2).
+
+ ENOMEM Insufficient memory was available to handle a page fault result-
+ ing from an MFC direct memory access.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+
+Notes
+=====
+ spu_run is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+Conforming to
+=============
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+Bugs
+====
+ The code does not yet fully implement all features lined out here.
+
+
+Author
+======
+ Arnd Bergmann <arndb@de.ibm.com>
+
+See Also
+========
+ capabilities(7), close(2), spu_create(2), spufs(7)
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs/spufs.rst
index eb9e3aa63026..ca0441cbe37e 100644
--- a/Documentation/filesystems/spufs.txt
+++ b/Documentation/filesystems/spufs/spufs.rst
@@ -1,12 +1,18 @@
-SPUFS(2) Linux Programmer's Manual SPUFS(2)
+.. SPDX-License-Identifier: GPL-2.0
+=====
+spufs
+=====
+Name
+====
-NAME
spufs - the SPU file system
-DESCRIPTION
+Description
+===========
+
The SPU file system is used on PowerPC machines that implement the Cell
Broadband Engine Architecture in order to access Synergistic Processor
Units (SPUs).
@@ -21,7 +27,9 @@ DESCRIPTION
ally add or remove files.
-MOUNT OPTIONS
+Mount Options
+=============
+
uid=<uid>
set the user owning the mount point, the default is 0 (root).
@@ -29,7 +37,9 @@ MOUNT OPTIONS
set the group owning the mount point, the default is 0 (root).
-FILES
+Files
+=====
+
The files in spufs mostly follow the standard behavior for regular sys-
tem calls like read(2) or write(2), but often support only a subset of
the operations supported on regular file systems. This list details the
@@ -125,14 +135,12 @@ FILES
space is available for writing.
- /mbox_stat
- /ibox_stat
- /wbox_stat
+ /mbox_stat, /ibox_stat, /wbox_stat
Read-only files that contain the length of the current queue, i.e. how
many words can be read from mbox or ibox or how many words can be
written to wbox without blocking. The files can be read only in 4-byte
units and return a big-endian binary integer number. The possible
- operations on an open *box_stat file are:
+ operations on an open ``*box_stat`` file are:
read(2)
If a count smaller than four is requested, read returns -1 and
@@ -143,12 +151,7 @@ FILES
in EAGAIN.
- /npc
- /decr
- /decr_status
- /spu_tag_mask
- /event_mask
- /srr0
+ /npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0
Internal registers of the SPU. The representation is an ASCII string
with the numeric value of the next instruction to be executed. These
can be used in read/write mode for debugging, but normal operation of
@@ -157,17 +160,14 @@ FILES
The contents of these files are:
+ =================== ===================================
npc Next Program Counter
-
decr SPU Decrementer
-
decr_status Decrementer Status
-
spu_tag_mask MFC tag mask for SPU DMA
-
event_mask Event mask for SPU interrupts
-
srr0 Interrupt Return address register
+ =================== ===================================
The possible operations on an open npc, decr, decr_status,
@@ -206,8 +206,7 @@ FILES
from the data buffer, updating the value of the fpcr register.
- /signal1
- /signal2
+ /signal1, /signal2
The two signal notification channels of an SPU. These are read-write
files that operate on a 32 bit word. Writing to one of these files
triggers an interrupt on the SPU. The value written to the signal
@@ -228,13 +227,12 @@ FILES
from the data buffer, updating the value of the specified signal
notification register. The signal notification register will
either be replaced with the input data or will be updated to the
- bitwise OR or the old value and the input data, depending on the
+ bitwise OR of the old value and the input data, depending on the
contents of the signal1_type, or signal2_type respectively,
file.
- /signal1_type
- /signal2_type
+ /signal1_type, /signal2_type
These two files change the behavior of the signal1 and signal2 notifi-
cation files. The contain a numerical ASCII string which is read as
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
@@ -259,263 +257,17 @@ FILES
the previous setting.
-EXAMPLES
+Examples
+========
/etc/fstab entry
none /spu spufs gid=spu 0 0
-AUTHORS
+Authors
+=======
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
-SEE ALSO
+See Also
+========
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPUFS(2)
-
-------------------------------------------------------------------------------
-
-SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
-
-
-
-NAME
- spu_run - execute an spu context
-
-
-SYNOPSIS
- #include <sys/spu.h>
-
- int spu_run(int fd, unsigned int *npc, unsigned int *event);
-
-DESCRIPTION
- The spu_run system call is used on PowerPC machines that implement the
- Cell Broadband Engine Architecture in order to access Synergistic Pro-
- cessor Units (SPUs). It uses the fd that was returned from spu_cre-
- ate(2) to address a specific SPU context. When the context gets sched-
- uled to a physical SPU, it starts execution at the instruction pointer
- passed in npc.
-
- Execution of SPU code happens synchronously, meaning that spu_run does
- not return while the SPU is still running. If there is a need to exe-
- cute SPU code in parallel with other code on either the main CPU or
- other SPUs, you need to create a new thread of execution first, e.g.
- using the pthread_create(3) call.
-
- When spu_run returns, the current value of the SPU instruction pointer
- is written back to npc, so you can call spu_run again without updating
- the pointers.
-
- event can be a NULL pointer or point to an extended status code that
- gets filled when spu_run returns. It can be one of the following con-
- stants:
-
- SPE_EVENT_DMA_ALIGNMENT
- A DMA alignment error
-
- SPE_EVENT_SPE_DATA_SEGMENT
- A DMA segmentation error
-
- SPE_EVENT_SPE_DATA_STORAGE
- A DMA storage error
-
- If NULL is passed as the event argument, these errors will result in a
- signal delivered to the calling process.
-
-RETURN VALUE
- spu_run returns the value of the spu_status register or -1 to indicate
- an error and set errno to one of the error codes listed below. The
- spu_status register value contains a bit mask of status codes and
- optionally a 14 bit code returned from the stop-and-signal instruction
- on the SPU. The bit masks for the status codes are:
-
- 0x02 SPU was stopped by stop-and-signal.
-
- 0x04 SPU was stopped by halt.
-
- 0x08 SPU is waiting for a channel.
-
- 0x10 SPU is in single-step mode.
-
- 0x20 SPU has tried to execute an invalid instruction.
-
- 0x40 SPU has tried to access an invalid channel.
-
- 0x3fff0000
- The bits masked with this value contain the code returned from
- stop-and-signal.
-
- There are always one or more of the lower eight bits set or an error
- code is returned from spu_run.
-
-ERRORS
- EAGAIN or EWOULDBLOCK
- fd is in non-blocking mode and spu_run would block.
-
- EBADF fd is not a valid file descriptor.
-
- EFAULT npc is not a valid pointer or status is neither NULL nor a valid
- pointer.
-
- EINTR A signal occurred while spu_run was in progress. The npc value
- has been updated to the new program counter value if necessary.
-
- EINVAL fd is not a file descriptor returned from spu_create(2).
-
- ENOMEM Insufficient memory was available to handle a page fault result-
- ing from an MFC direct memory access.
-
- ENOSYS the functionality is not provided by the current system, because
- either the hardware does not provide SPUs or the spufs module is
- not loaded.
-
-
-NOTES
- spu_run is meant to be used from libraries that implement a more
- abstract interface to SPUs, not to be used from regular applications.
- See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
- ommended libraries.
-
-
-CONFORMING TO
- This call is Linux specific and only implemented by the ppc64 architec-
- ture. Programs using this system call are not portable.
-
-
-BUGS
- The code does not yet fully implement all features lined out here.
-
-
-AUTHOR
- Arnd Bergmann <arndb@de.ibm.com>
-
-SEE ALSO
- capabilities(7), close(2), spu_create(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPU_RUN(2)
-
-------------------------------------------------------------------------------
-
-SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
-
-
-
-NAME
- spu_create - create a new spu context
-
-
-SYNOPSIS
- #include <sys/types.h>
- #include <sys/spu.h>
-
- int spu_create(const char *pathname, int flags, mode_t mode);
-
-DESCRIPTION
- The spu_create system call is used on PowerPC machines that implement
- the Cell Broadband Engine Architecture in order to access Synergistic
- Processor Units (SPUs). It creates a new logical context for an SPU in
- pathname and returns a handle to associated with it. pathname must
- point to a non-existing directory in the mount point of the SPU file
- system (spufs). When spu_create is successful, a directory gets cre-
- ated on pathname and it is populated with files.
-
- The returned file handle can only be passed to spu_run(2) or closed,
- other operations are not defined on it. When it is closed, all associ-
- ated directory entries in spufs are removed. When the last file handle
- pointing either inside of the context directory or to this file
- descriptor is closed, the logical SPU context is destroyed.
-
- The parameter flags can be zero or any bitwise or'd combination of the
- following constants:
-
- SPU_RAWIO
- Allow mapping of some of the hardware registers of the SPU into
- user space. This flag requires the CAP_SYS_RAWIO capability, see
- capabilities(7).
-
- The mode parameter specifies the permissions used for creating the new
- directory in spufs. mode is modified with the user's umask(2) value
- and then used for both the directory and the files contained in it. The
- file permissions mask out some more bits of mode because they typically
- support only read or write access. See stat(2) for a full list of the
- possible mode values.
-
-
-RETURN VALUE
- spu_create returns a new file descriptor. It may return -1 to indicate
- an error condition and set errno to one of the error codes listed
- below.
-
-
-ERRORS
- EACCES
- The current user does not have write access on the spufs mount
- point.
-
- EEXIST An SPU context already exists at the given path name.
-
- EFAULT pathname is not a valid string pointer in the current address
- space.
-
- EINVAL pathname is not a directory in the spufs mount point.
-
- ELOOP Too many symlinks were found while resolving pathname.
-
- EMFILE The process has reached its maximum open file limit.
-
- ENAMETOOLONG
- pathname was too long.
-
- ENFILE The system has reached the global open file limit.
-
- ENOENT Part of pathname could not be resolved.
-
- ENOMEM The kernel could not allocate all resources required.
-
- ENOSPC There are not enough SPU resources available to create a new
- context or the user specific limit for the number of SPU con-
- texts has been reached.
-
- ENOSYS the functionality is not provided by the current system, because
- either the hardware does not provide SPUs or the spufs module is
- not loaded.
-
- ENOTDIR
- A part of pathname is not a directory.
-
-
-
-NOTES
- spu_create is meant to be used from libraries that implement a more
- abstract interface to SPUs, not to be used from regular applications.
- See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
- ommended libraries.
-
-
-FILES
- pathname must point to a location beneath the mount point of spufs. By
- convention, it gets mounted in /spu.
-
-
-CONFORMING TO
- This call is Linux specific and only implemented by the ppc64 architec-
- ture. Programs using this system call are not portable.
-
-
-BUGS
- The code does not yet fully implement all features lined out here.
-
-
-AUTHOR
- Arnd Bergmann <arndb@de.ibm.com>
-
-SEE ALSO
- capabilities(7), close(2), spu_run(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPU_CREATE(2)
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index df42106bae71..4af8d6207509 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -64,6 +64,66 @@ obtained from this site also.
The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+2.1 Mount options
+-----------------
+=================== =========================================================
+errors=%s Specify whether squashfs errors trigger a kernel panic
+ or not
+
+ ========== =============================================
+ continue errors don't trigger a panic (default)
+ panic trigger a panic when errors are encountered,
+ similar to several other filesystems (e.g.
+ btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs)
+
+ This allows a kernel dump to be saved,
+ useful for analyzing and debugging the
+ corruption.
+ ========== =============================================
+threads=%s Select the decompression mode or the number of threads
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
+
+ ========== =============================================
+ single use single-threaded decompression (default)
+
+ Only one block (data or metadata) can be
+ decompressed at any one time. This limits
+ CPU and memory usage to a minimum, but it
+ also gives poor performance on parallel I/O
+ workloads when using multiple CPU machines
+ due to waiting on decompressor availability.
+ multi use up to two parallel decompressors per core
+
+ If you have a parallel I/O workload and your
+ system has enough memory, using this option
+ may improve overall I/O performance. It
+ dynamically allocates decompressors on a
+ demand basis.
+ percpu use a maximum of one decompressor per core
+
+ It uses percpu variables to ensure
+ decompression is load-balanced across the
+ cores.
+ 1|2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and
+ SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are
+ both set:
+
+ ========== =============================================
+ 2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+=================== =========================================================
+
3. Squashfs Filesystem Design
-----------------------------
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
deleted file mode 100644
index 06f1d64c6f70..000000000000
--- a/Documentation/filesystems/sysfs-pci.txt
+++ /dev/null
@@ -1,131 +0,0 @@
-Accessing PCI device resources through sysfs
---------------------------------------------
-
-sysfs, usually mounted at /sys, provides access to PCI resources on platforms
-that support it. For example, a given bus might look like this:
-
- /sys/devices/pci0000:17
- |-- 0000:17:00.0
- | |-- class
- | |-- config
- | |-- device
- | |-- enable
- | |-- irq
- | |-- local_cpus
- | |-- remove
- | |-- resource
- | |-- resource0
- | |-- resource1
- | |-- resource2
- | |-- revision
- | |-- rom
- | |-- subsystem_device
- | |-- subsystem_vendor
- | `-- vendor
- `-- ...
-
-The topmost element describes the PCI domain and bus number. In this case,
-the domain number is 0000 and the bus number is 17 (both values are in hex).
-This bus contains a single function device in slot 0. The domain and bus
-numbers are reproduced for convenience. Under the device directory are several
-files, each with their own function.
-
- file function
- ---- --------
- class PCI class (ascii, ro)
- config PCI config space (binary, rw)
- device PCI device (ascii, ro)
- enable Whether the device is enabled (ascii, rw)
- irq IRQ number (ascii, ro)
- local_cpus nearby CPU mask (cpumask, ro)
- remove remove device from kernel's list (ascii, wo)
- resource PCI resource host addresses (ascii, ro)
- resource0..N PCI resource N, if present (binary, mmap, rw[1])
- resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
- revision PCI revision (ascii, ro)
- rom PCI ROM resource, if present (binary, ro)
- subsystem_device PCI subsystem device (ascii, ro)
- subsystem_vendor PCI subsystem vendor (ascii, ro)
- vendor PCI vendor (ascii, ro)
-
- ro - read only file
- rw - file is readable and writable
- wo - write only file
- mmap - file is mmapable
- ascii - file contains ascii text
- binary - file contains binary data
- cpumask - file contains a cpumask type
-
-[1] rw for RESOURCE_IO (I/O port) regions only
-
-The read only files are informational, writes to them will be ignored, with
-the exception of the 'rom' file. Writable files can be used to perform
-actions on the device (e.g. changing config space, detaching a device).
-mmapable files are available via an mmap of the file at offset 0 and can be
-used to do actual device programming from userspace. Note that some platforms
-don't support mmapping of certain resources, so be sure to check the return
-value from any attempted mmap. The most notable of these are I/O port
-resources, which also provide read/write access.
-
-The 'enable' file provides a counter that indicates how many times the device
-has been enabled. If the 'enable' file currently returns '4', and a '1' is
-echoed into it, it will then return '5'. Echoing a '0' into it will decrease
-the count. Even when it returns to 0, though, some of the initialisation
-may not be reversed.
-
-The 'rom' file is special in that it provides read-only access to the device's
-ROM file, if available. It's disabled by default, however, so applications
-should write the string "1" to the file to enable it before attempting a read
-call, and disable it following the access by writing "0" to the file. Note
-that the device must be enabled for a rom read to return data successfully.
-In the event a driver is not bound to the device, it can be enabled using the
-'enable' file, documented above.
-
-The 'remove' file is used to remove the PCI device, by writing a non-zero
-integer to the file. This does not involve any kind of hot-plug functionality,
-e.g. powering off the device. The device is removed from the kernel's list of
-PCI devices, the sysfs directory for it is removed, and the device will be
-removed from any drivers attached to it. Removal of PCI root buses is
-disallowed.
-
-Accessing legacy resources through sysfs
-----------------------------------------
-
-Legacy I/O port and ISA memory resources are also provided in sysfs if the
-underlying platform supports them. They're located in the PCI class hierarchy,
-e.g.
-
- /sys/class/pci_bus/0000:17/
- |-- bridge -> ../../../devices/pci0000:17
- |-- cpuaffinity
- |-- legacy_io
- `-- legacy_mem
-
-The legacy_io file is a read/write file that can be used by applications to
-do legacy port I/O. The application should open the file, seek to the desired
-port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
-file should be mmapped with an offset corresponding to the memory offset
-desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
-simply dereference the returned pointer (after checking for errors of course)
-to access legacy memory space.
-
-Supporting PCI access on new platforms
---------------------------------------
-
-In order to support PCI resource mapping as described above, Linux platform
-code should ideally define ARCH_GENERIC_PCI_MMAP_RESOURCE and use the generic
-implementation of that functionality. To support the historical interface of
-mmap() through files in /proc/bus/pci, platforms may also set HAVE_PCI_MMAP.
-
-Alternatively, platforms which set HAVE_PCI_MMAP may provide their own
-implementation of pci_mmap_page_range() instead of defining
-ARCH_GENERIC_PCI_MMAP_RESOURCE.
-
-Platforms which support write-combining maps of PCI resources must define
-arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
-write-combining is permitted. Platforms which support maps of I/O resources
-define arch_can_pci_mmap_io() similarly.
-
-Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
-wishing to support legacy functionality should define it and provide
-pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
deleted file mode 100644
index c7c8e6438958..000000000000
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ /dev/null
@@ -1,42 +0,0 @@
-Sysfs tagging
--------------
-
-(Taken almost verbatim from Eric Biederman's netns tagging patch
-commit msg)
-
-The problem. Network devices show up in sysfs and with the network
-namespace active multiple devices with the same name can show up in
-the same directory, ouch!
-
-To avoid that problem and allow existing applications in network
-namespaces to see the same interface that is currently presented in
-sysfs, sysfs now has tagging directory support.
-
-By using the network namespace pointers as tags to separate out the
-the sysfs directory entries we ensure that we don't have conflicts
-in the directories and applications only see a limited set of
-the network devices.
-
-Each sysfs directory entry may be tagged with a namespace via the
-void *ns member of its kernfs_node. If a directory entry is tagged,
-then kernfs_node->flags will have a flag between KOBJ_NS_TYPE_NONE
-and KOBJ_NS_TYPES, and ns will point to the namespace to which it
-belongs.
-
-Each sysfs superblock's kernfs_super_info contains an array void
-*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace
-kobj_nstype first mounts sysfs, a new superblock is created. It
-will be differentiated from other sysfs mounts by having its
-s_fs_info->ns[kobj_nstype] set to the new namespace. Note that
-through bind mounting and mounts propagation, a task can easily view
-the contents of other namespaces' sysfs mounts. Therefore, when a
-namespace exits, it will call kobj_ns_exit() to invalidate any
-kernfs_node->ns pointers pointing to it.
-
-Users of this interface:
-- define a type in the kobj_ns_type enumeration.
-- call kobj_ns_type_register() with its kobj_ns_type_operations which has
- - current_ns() which returns current's namespace
- - netlink_ns() which returns a socket's namespace
- - initial_ns() which returns the initial namesapce
-- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst
index 290891c3fecb..c32993bc83c7 100644
--- a/Documentation/filesystems/sysfs.rst
+++ b/Documentation/filesystems/sysfs.rst
@@ -12,15 +12,15 @@ Mike Murphy <mamurph@cs.clemson.edu>
:Original: 10 January 2003
-What it is:
-~~~~~~~~~~~
+What it is
+~~~~~~~~~~
-sysfs is a ram-based filesystem initially based on ramfs. It provides
+sysfs is a RAM-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
sysfs is tied inherently to the kobject infrastructure. Please read
-Documentation/kobject.txt for more information concerning the kobject
+Documentation/core-api/kobject.rst for more information concerning the kobject
interface.
@@ -43,7 +43,7 @@ userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
-Sysfs internally stores a pointer to the kobject that implements a
+sysfs internally stores a pointer to the kobject that implements a
directory in the kernfs_node object associated with the directory. In
the past this kobject pointer has been used by sysfs to do reference
counting directly on the kobject whenever the file is opened or closed.
@@ -55,7 +55,7 @@ Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
-the filesystem. Sysfs forwards file I/O operations to methods defined
+the filesystem. sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
@@ -72,8 +72,8 @@ you publicly humiliated and your code rewritten without notice.
An attribute definition is simply::
struct attribute {
- char * name;
- struct module *owner;
+ char *name;
+ struct module *owner;
umode_t mode;
};
@@ -138,7 +138,7 @@ __ATTR_WO(name):
assumes a name_store only and is restricted to mode
0200 that is root write access only.
__ATTR_RO_MODE(name, mode):
- fore more restrictive RO access currently
+ for more restrictive RO access; currently
only use case is the EFI System Resource Table
(see drivers/firmware/efi/esrt.c)
__ATTR_RW(name):
@@ -172,14 +172,13 @@ calls the associated methods.
To illustrate::
- #define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct device_attribute *dev_attr = to_dev_attr(attr);
- struct device *dev = to_dev(kobj);
+ struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
if (dev_attr->show)
@@ -208,7 +207,7 @@ IOW, they should take only an object, an attribute, and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
-method. Sysfs will call the method exactly once for each read or
+method. sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
@@ -222,7 +221,7 @@ implementations:
be called again, rearmed, to fill the buffer.
- On write(2), sysfs expects the entire buffer to be passed during the
- first write. Sysfs then passes the entire buffer to the store() method.
+ first write. sysfs then passes the entire buffer to the store() method.
A terminating null is added after the data on stores. This makes
functions like sysfs_streq() safe to use.
@@ -238,16 +237,14 @@ Other notes:
- Writing causes the show() method to be rearmed regardless of current
file position.
-- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+- The buffer will always be PAGE_SIZE bytes in length. On x86, this
is 4096.
- show() methods should return the number of bytes printed into the
- buffer. This is the return value of scnprintf().
+ buffer.
-- show() must not use snprintf() when formatting the value to be
- returned to user space. If you can guarantee that an overflow
- will never happen you can use sprintf() otherwise you must use
- scnprintf().
+- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
+ the value to be returned to user space.
- store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
@@ -256,7 +253,7 @@ Other notes:
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
- referencing counting its embedded object. However, the physical
+ reference counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
@@ -266,7 +263,7 @@ A very simple (and naive) implementation of a device attribute is::
static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf)
{
- return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
+ return sysfs_emit(buf, "%s\n", dev->name);
}
static ssize_t store_name(struct device *dev, struct device_attribute *attr,
@@ -298,8 +295,12 @@ The top level sysfs directory looks like::
dev/
devices/
firmware/
- net/
fs/
+ hypervisor/
+ kernel/
+ module/
+ net/
+ power/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
@@ -320,15 +321,18 @@ span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
-below fs/ (see ./fuse.txt for an example).
+below fs/ (see ./fuse.rst for an example).
+
+module/ contains parameter values and state information for all
+loaded system modules, for both builtin and loadable modules.
-dev/ contains two directories char/ and block/. Inside these two
+dev/ contains two directories: char/ and block/. Inside these two
directories there are symlinks named <major>:<minor>. These symlinks
point to the sysfs directory for the given device. /sys/dev provides a
quick way to lookup the sysfs interface for a device from the result of
a stat(2) operation.
-More information can driver-model specific features can be found in
+More information on driver-model specific features can be found in
Documentation/driver-api/driver-model/.
@@ -338,7 +342,7 @@ TODO: Finish this section.
Current Interfaces
~~~~~~~~~~~~~~~~~~
-The following interface layers currently exist in sysfs:
+The following interface layers currently exist in sysfs.
devices (include/linux/device.h)
@@ -369,8 +373,8 @@ Structure::
struct bus_attribute {
struct attribute attr;
- ssize_t (*show)(struct bus_type *, char * buf);
- ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
+ ssize_t (*show)(const struct bus_type *, char * buf);
+ ssize_t (*store)(const struct bus_type *, const char * buf, size_t count);
};
Declaring::
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 4e95929301a5..56a26c843dbe 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -4,7 +4,7 @@
Tmpfs
=====
-Tmpfs is a file system which keeps all files in virtual memory.
+Tmpfs is a file system which keeps all of its files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be
@@ -13,17 +13,29 @@ everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
-unneeded pages out to swap space. It has maximum size limits which can
-be adjusted on the fly via 'mount -o remount ...'
-
-If you compare it to ramfs (which was the template to create tmpfs)
-you gain swapping and limit checking. Another similar thing is the RAM
-disk (/dev/ram*), which simulates a fixed size hard disk in physical
-RAM, where you have to create an ordinary filesystem on top. Ramdisks
-cannot swap and you do not have the possibility to resize them.
-
-Since tmpfs lives completely in the page cache and on swap, all tmpfs
-pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
+unneeded pages out to swap space, if swap was enabled for the tmpfs
+mount. tmpfs also supports THP.
+
+tmpfs extends ramfs with a few userspace configurable options listed and
+explained further below, some of which can be reconfigured dynamically on the
+fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
+filesystem can be resized but it cannot be resized to a size below its current
+usage. tmpfs also supports POSIX ACLs, and extended attributes for the
+trusted.*, security.* and user.* namespaces. ramfs does not use swap and you
+cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs
+filesystem is how much memory you have available, and so care must be taken if
+used so to not run out of memory.
+
+An alternative to tmpfs and ramfs is to use brd to create RAM disks
+(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
+To write data you would just then need to create an regular filesystem on top
+this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
+configured in size at initialization and you cannot dynamically resize them.
+Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
+block layer at all.
+
+Since tmpfs lives completely in the page cache and optionally on swap,
+all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
free(1). Notice that these counters also include shared memory
(shmem, see ipcs(1)). The most reliable way to get the count is
using df(1) and du(1).
@@ -35,7 +47,7 @@ tmpfs has the following uses:
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
- set, the user visible part of tmpfs is not build. But the internal
+ set, the user visible part of tmpfs is not built. But the internal
mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
@@ -50,7 +62,7 @@ tmpfs has the following uses:
This mount is _not_ needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
- shared memory)
+ shared memory.)
3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
@@ -83,8 +95,67 @@ If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
if nr_inodes=0, inodes will not be limited. It is generally unwise to
mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
-that instance in a system with many cpus making intensive use of it.
-
+that instance in a system with many CPUs making intensive use of it.
+
+If nr_inodes is not 0, that limited space for inodes is also used up by
+extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases.
+
+tmpfs blocks may be swapped out, when there is a shortage of memory.
+tmpfs has a mount option to disable its use of swap:
+
+====== ===========================================================
+noswap Disables swap. Remounts must respect the original settings.
+ By default swap is enabled.
+====== ===========================================================
+
+tmpfs also supports Transparent Huge Pages which requires a kernel
+configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
+your system (has_transparent_hugepage(), which is architecture specific).
+The mount options for this are:
+
+================ ==============================================================
+huge=never Do not allocate huge pages. This is the default.
+huge=always Attempt to allocate huge page every time a new page is needed.
+huge=within_size Only allocate huge page if it will be fully within i_size.
+ Also respect madvise(2) hints.
+huge=advise Only allocate huge page if requested with madvise(2).
+================ ==============================================================
+
+See also Documentation/admin-guide/mm/transhuge.rst, which describes the
+sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can
+be used to deny huge pages on all tmpfs mounts in an emergency, or to
+force huge pages on all tmpfs mounts for testing.
+
+tmpfs also supports quota with the following mount options
+
+======================== =================================================
+quota User and group quota accounting and enforcement
+ is enabled on the mount. Tmpfs is using hidden
+ system quota files that are initialized on mount.
+usrquota User quota accounting and enforcement is enabled
+ on the mount.
+grpquota Group quota accounting and enforcement is enabled
+ on the mount.
+usrquota_block_hardlimit Set global user quota block hard limit.
+usrquota_inode_hardlimit Set global user quota inode hard limit.
+grpquota_block_hardlimit Set global group quota block hard limit.
+grpquota_inode_hardlimit Set global group quota inode hard limit.
+======================== =================================================
+
+None of the quota related mount options can be set or changed on remount.
+
+Quota limit parameters accept a suffix k, m or g for kilo, mega and giga
+and can't be changed on remount. Default global quota limits are taking
+effect for any and all user/group/project except root the first time the
+quota entry for user/group/project id is being accessed - typically the
+first time an inode with a particular id ownership is being created after
+the mount. In other words, instead of the limits being initialized to zero,
+they are initialized with the particular value provided with these mount
+options. The limits can be changed for any user/group id at any time as they
+normally can be.
+
+Note that tmpfs quotas do not support user namespaces so no uid/gid
+translation is done if quotas are enabled inside user namespaces.
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
@@ -150,6 +221,22 @@ These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+======= ========================
+inode64 Use 64-bit inode numbers
+inode32 Use 32-bit inode numbers
+======= ========================
+
+On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
+On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the
+possibility of multiple files with the same inode number on a single device;
+but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
+if a long-lived tmpfs is accessed by 32-bit applications so ancient that
+opening a file larger than 2GiB fails with EINVAL.
+
+
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
@@ -161,3 +248,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
Hugh Dickins, 4 June 2007
:Updated:
KOSAKI Motohiro, 16 Mar 2010
+:Updated:
+ Chris Down, 13 July 2020
diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst
index 16efd729bf7c..3d85ee88719a 100644
--- a/Documentation/filesystems/ubifs-authentication.rst
+++ b/Documentation/filesystems/ubifs-authentication.rst
@@ -1,11 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
-:orphan:
-
.. UBIFS Authentication
.. sigma star gmbh
.. 2018
+============================
+UBIFS Authentication Support
+============================
+
Introduction
============
@@ -128,7 +130,7 @@ marked as dirty are written to the flash to update the persisted index.
Journal
~~~~~~~
-To avoid wearing out the flash, the index is only persisted (*commited*) when
+To avoid wearing out the flash, the index is only persisted (*committed*) when
certain conditions are met (eg. ``fsync(2)``). The journal is used to record
any changes (in form of inode nodes, data nodes etc.) between commits
of the index. During mount, the journal is read from the flash and replayed
@@ -433,9 +435,9 @@ will then have to be provided beforehand in the normal way.
References
==========
-[CRYPTSETUP2] http://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
+[CRYPTSETUP2] https://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
-[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
+[DMC-CBC-ATTACK] https://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
diff --git a/Documentation/filesystems/ubifs.rst b/Documentation/filesystems/ubifs.rst
index e6ee99762534..ced2f7679ddb 100644
--- a/Documentation/filesystems/ubifs.rst
+++ b/Documentation/filesystems/ubifs.rst
@@ -59,7 +59,7 @@ differences.
* JFFS2 is a write-through file-system, while UBIFS supports write-back,
which makes UBIFS much faster on writes.
-Similarly to JFFS2, UBIFS supports on-the-flight compression which makes
+Similarly to JFFS2, UBIFS supports on-the-fly compression which makes
it possible to fit quite a lot of data to the flash.
Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts.
diff --git a/Documentation/filesystems/udf.rst b/Documentation/filesystems/udf.rst
index d9badbf285b2..f9489ddbb767 100644
--- a/Documentation/filesystems/udf.rst
+++ b/Documentation/filesystems/udf.rst
@@ -72,4 +72,4 @@ For the latest version and toolset see:
Documentation on UDF and ECMA 167 is available FREE from:
- http://www.osta.org/
- - http://www.ecma-international.org/
+ - https://www.ecma-international.org/
diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
index e85d74e91295..b289c4449cd0 100644
--- a/Documentation/filesystems/vfat.rst
+++ b/Documentation/filesystems/vfat.rst
@@ -50,7 +50,7 @@ VFAT MOUNT OPTIONS
Normally utime(2) checks current process is owner of
the file, or it has CAP_FOWNER capability. But FAT
filesystem doesn't have uid/gid on disk, so normal
- check is too unflexible. With this option you can
+ check is too inflexible. With this option you can
relax it.
**codepage=###**
@@ -189,7 +189,7 @@ VFAT MOUNT OPTIONS
**discard**
If set, issues discard/TRIM commands to the block
device when blocks are freed. This is useful for SSD devices
- and sparse/thinly-provisoned LUNs.
+ and sparse/thinly-provisioned LUNs.
**nfs=stale_rw|nostale_ro**
Enable this only if you want to export the FAT filesystem
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..eebcc0f9e2bc 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -107,22 +107,32 @@ file /proc/filesystems.
struct file_system_type
-----------------------
-This describes the filesystem. As of kernel 2.6.39, the following
+This describes the filesystem. The following
members are defined:
.. code-block:: c
- struct file_system_operations {
+ struct file_system_type {
const char *name;
int fs_flags;
+ int (*init_fs_context)(struct fs_context *);
+ const struct fs_parameter_spec *parameters;
struct dentry *(*mount) (struct file_system_type *, int,
- const char *, void *);
+ const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
- struct list_head fs_supers;
+ struct hlist_head fs_supers;
+
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
+ struct lock_class_key s_vfs_rename_key;
+ struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];
+
+ struct lock_class_key i_lock_key;
+ struct lock_class_key i_mutex_key;
+ struct lock_class_key invalidate_lock_key;
+ struct lock_class_key i_mutex_dir_key;
};
``name``
@@ -132,6 +142,15 @@ members are defined:
``fs_flags``
various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+``init_fs_context``
+ Initializes 'struct fs_context' ->ops and ->fs_private fields with
+ filesystem-specific data.
+
+``parameters``
+ Pointer to the array of filesystem parameters descriptors
+ 'struct fs_parameter_spec'.
+ More info in Documentation/filesystems/mount_api.rst.
+
``mount``
the method to call when a new instance of this filesystem should
be mounted
@@ -148,7 +167,11 @@ members are defined:
``next``
for internal VFS use: you should initialize this to NULL
- s_lock_key, s_umount_key: lockdep-specific
+``fs_supers``
+ for internal VFS use: hlist of filesystem instances (superblocks)
+
+ s_lock_key, s_umount_key, s_vfs_rename_key, s_writers_key,
+ i_lock_key, i_mutex_key, invalidate_lock_key, i_mutex_dir_key: lockdep-specific
The mount() method has the following arguments:
@@ -222,33 +245,44 @@ struct super_operations
-----------------------
This describes how the VFS can manipulate the superblock of your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. The following members are defined:
.. code-block:: c
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
+ void (*free_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
- int (*write_inode) (struct inode *, int);
- void (*drop_inode) (struct inode *);
- void (*delete_inode) (struct inode *);
+ int (*write_inode) (struct inode *, struct writeback_control *wbc);
+ int (*drop_inode) (struct inode *);
+ void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
+ int (*freeze_super) (struct super_block *sb,
+ enum freeze_holder who);
int (*freeze_fs) (struct super_block *);
+ int (*thaw_super) (struct super_block *sb,
+ enum freeze_wholder who);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
- void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
+ int (*show_devname)(struct seq_file *, struct dentry *);
+ int (*show_path)(struct seq_file *, struct dentry *);
+ int (*show_stats)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
- int (*nr_cached_objects)(struct super_block *);
- void (*free_cached_objects)(struct super_block *, int);
+ struct dquot **(*get_dquots)(struct inode *);
+
+ long (*nr_cached_objects)(struct super_block *,
+ struct shrink_control *);
+ long (*free_cached_objects)(struct super_block *,
+ struct shrink_control *);
};
All methods are called without any locks being held, unless otherwise
@@ -269,8 +303,19 @@ or bottom half).
->alloc_inode was defined and simply undoes anything done by
->alloc_inode.
+``free_inode``
+ this method is called from RCU callback. If you use call_rcu()
+ in ->destroy_inode to free 'struct inode' memory, then it's
+ better to release memory in this method.
+
``dirty_inode``
- this method is called by the VFS to mark an inode dirty.
+ this method is called by the VFS when an inode is marked dirty.
+ This is specifically for the inode itself being marked dirty,
+ not its data. If the update needs to be persisted by fdatasync(),
+ then I_DIRTY_DATASYNC will be set in the flags argument.
+ I_DIRTY_TIME will be set in the flags in case lazytime is enabled
+ and struct inode has times updated since the last ->dirty_inode
+ call.
``write_inode``
this method is called when the VFS needs to write an inode to
@@ -290,8 +335,12 @@ or bottom half).
practice of using "force_delete" in the put_inode() case, but
does not have the races that the "force_delete()" approach had.
-``delete_inode``
- called when the VFS wants to delete an inode
+``evict_inode``
+ called when the VFS wants to evict an inode. Caller does
+ *not* evict the pagecache or inode-associated metadata buffers;
+ the method has to use truncate_inode_pages_final() to get rid
+ of those. Caller makes sure async writeback cannot be running for
+ the inode while (or after) ->evict_inode() is called. Optional.
``put_super``
called when the VFS wishes to free the superblock
@@ -302,14 +351,25 @@ or bottom half).
superblock. The second parameter indicates whether the method
should wait until the write out has been completed. Optional.
+``freeze_super``
+ Called instead of ->freeze_fs callback if provided.
+ Main difference is that ->freeze_super is called without taking
+ down_write(&sb->s_umount). If filesystem implements it and wants
+ ->freeze_fs to be called too, then it has to call ->freeze_fs
+ explicitly from this callback. Optional.
+
``freeze_fs``
called when VFS is locking a filesystem and forcing it into a
consistent state. This method is currently used by the Logical
- Volume Manager (LVM).
+ Volume Manager (LVM) and ioctl(FIFREEZE). Optional.
+
+``thaw_super``
+ called when VFS is unlocking a filesystem and making it writable
+ again after ->freeze_super. Optional.
``unfreeze_fs``
called when VFS is unlocking a filesystem and making it writable
- again.
+ again after ->freeze_fs. Optional.
``statfs``
called when the VFS needs to get filesystem statistics.
@@ -318,22 +378,37 @@ or bottom half).
called when the filesystem is remounted. This is called with
the kernel lock held
-``clear_inode``
- called then the VFS clears the inode. Optional
-
``umount_begin``
called when the VFS is unmounting a filesystem.
``show_options``
- called by the VFS to show mount options for /proc/<pid>/mounts.
+ called by the VFS to show mount options for /proc/<pid>/mounts
+ and /proc/<pid>/mountinfo.
(see "Mount Options" section)
+``show_devname``
+ Optional. Called by the VFS to show device name for
+ /proc/<pid>/{mounts,mountinfo,mountstats}. If not provided then
+ '(struct mount).mnt_devname' will be used.
+
+``show_path``
+ Optional. Called by the VFS (for /proc/<pid>/mountinfo) to show
+ the mount root dentry path relative to the filesystem root.
+
+``show_stats``
+ Optional. Called by the VFS (for /proc/<pid>/mountstats) to show
+ filesystem-specific mount statistics.
+
``quota_read``
called by the VFS to read from filesystem quota file.
``quota_write``
called by the VFS to write to filesystem quota file.
+``get_dquots``
+ called by quota to get 'struct dquot' array for a particular inode.
+ Optional.
+
``nr_cached_objects``
called by the sb cache shrinking function for the filesystem to
return the number of freeable cached objects it contains.
@@ -362,7 +437,7 @@ field. This is a pointer to a "struct inode_operations" which describes
the methods that can be performed on individual inodes.
-struct xattr_handlers
+struct xattr_handler
---------------------
On filesystems that support extended attributes (xattrs), the s_xattr
@@ -392,7 +467,7 @@ Extended attributes are name:value pairs.
``set``
Called by the VFS to set the value of a particular extended
attribute. When the new value is NULL, called to remove a
- particular extended attribute. This method is called by the the
+ particular extended attribute. This method is called by the
setxattr(2) and removexattr(2) system calls.
When none of the xattr handlers of a filesystem match the specified
@@ -415,28 +490,34 @@ As of kernel 2.6.22, the following members are defined:
.. code-block:: c
struct inode_operations {
- int (*create) (struct inode *,struct dentry *, umode_t, bool);
+ int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
- int (*symlink) (struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct inode *,struct dentry *,umode_t);
+ int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
- int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
- int (*rename) (struct inode *, struct dentry *,
+ int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *,
struct delayed_call *);
- int (*permission) (struct inode *, int);
- int (*get_acl)(struct inode *, int);
- int (*setattr) (struct dentry *, struct iattr *);
- int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
+ int (*permission) (struct mnt_idmap *, struct inode *, int);
+ struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
+ int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
+ int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *, struct file *,
unsigned open_flag, umode_t create_mode);
- int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+ int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
+ struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
+ int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
+ int (*fileattr_set)(struct mnt_idmap *idmap,
+ struct dentry *dentry, struct fileattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
Again, all methods are called without any locks being held, unless
@@ -582,8 +663,25 @@ otherwise noted.
``tmpfile``
called in the end of O_TMPFILE open(). Optional, equivalent to
atomically creating, opening and unlinking a file in given
- directory.
-
+ directory. On success needs to return with the file already
+ open; this can be done by calling finish_open_simple() right at
+ the end.
+
+``fileattr_get``
+ called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to
+ retrieve miscellaneous file flags and attributes. Also called
+ before the relevant SET operation to check what is being changed
+ (in this case with i_rwsem locked exclusive). If unset, then
+ fall back to f_op->ioctl().
+
+``fileattr_set``
+ called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
+ change miscellaneous file flags and attributes. Callers hold
+ i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
+``get_offset_ctx``
+ called to get the offset context for a directory inode. A
+ filesystem must define this operation to use
+ simple_offset_dir_operations.
The Address Space Object
========================
@@ -601,9 +699,9 @@ Writeback.
The first can be used independently to the others. The VM can try to
either write dirty pages in order to clean them, or release clean pages
in order to reuse them. To do this it can call the ->writepage method
-on dirty pages, and ->releasepage on clean pages with PagePrivate set.
-Clean pages without PagePrivate and with no external references will be
-released without notice being given to the address_space.
+on dirty pages, and ->release_folio on clean folios with the private
+flag set. Clean pages without PagePrivate and with no external references
+will be released without notice being given to the address_space.
To achieve this functionality, pages need to be placed on an LRU with
lru_cache_add and mark_page_active needs to be called whenever the page
@@ -637,9 +735,9 @@ by memory-mapping the page. Data is written into the address space by
the application, and then written-back to storage typically in whole
pages, however the address_space has finer control of write sizes.
-The read process essentially only requires 'readpage'. The write
+The read process essentially only requires 'read_folio'. The write
process is more complicated and uses write_begin/write_end or
-set_page_dirty to write data into the address_space, and writepage and
+dirty_folio to write data into the address_space, and writepage and
writepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by the
@@ -652,7 +750,7 @@ at any point after PG_Dirty is clear. Once it is known to be safe,
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the the writepage and writepages operations some
+operations. This gives the writepage and writepages operations some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
return information back to the caller about the result of a writepage or
@@ -669,7 +767,7 @@ is an error during writeback, they expect that error to be reported when
a file sync request is made. After an error has been reported on one
request, subsequent requests on the same file descriptor should return
0, unless further writeback errors have occurred since the previous file
-syncronization.
+synchronization.
Ideally, the kernel would report errors only on file descriptions on
which writes were done that subsequently failed to be written back. The
@@ -703,36 +801,32 @@ cache in your filesystem. The following members are defined:
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
- int (*readpage)(struct file *, struct page *);
+ int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
- int (*set_page_dirty)(struct page *page);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
+ bool (*dirty_folio)(struct address_space *, struct folio *);
+ void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
+ loff_t pos, unsigned len,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
- void (*invalidatepage) (struct page *, unsigned int, unsigned int);
- int (*releasepage) (struct page *, int);
- void (*freepage)(struct page *);
+ void (*invalidate_folio) (struct folio *, size_t start, size_t len);
+ bool (*release_folio)(struct folio *, gfp_t);
+ void (*free_folio)(struct folio *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- /* isolate a page for migration */
- bool (*isolate_page) (struct page *, isolate_mode_t);
- /* migrate the contents of a page to the specified target */
- int (*migratepage) (struct page *, struct page *);
- /* put migration-failed page back to right list */
- void (*putback_page) (struct page *);
- int (*launder_page) (struct page *);
-
- int (*is_partially_uptodate) (struct page *, unsigned long,
- unsigned long);
- void (*is_dirty_writeback) (struct page *, bool *, bool *);
- int (*error_remove_page) (struct mapping *mapping, struct page *page);
- int (*swap_activate)(struct file *);
+ int (*migrate_folio)(struct mapping *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
+ int (*launder_folio) (struct folio *);
+
+ bool (*is_partially_uptodate) (struct folio *, size_t from,
+ size_t count);
+ void (*is_dirty_writeback)(struct folio *, bool *, bool *);
+ int (*error_remove_folio)(struct mapping *mapping, struct folio *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
``writepage``
@@ -754,39 +848,73 @@ cache in your filesystem. The following members are defined:
See the file "Locking" for more details.
-``readpage``
- called by the VM to read a page from backing store. The page
- will be Locked when readpage is called, and should be unlocked
- and marked uptodate once the read completes. If ->readpage
- discovers that it needs to unlock the page for some reason, it
- can do so, and then return AOP_TRUNCATED_PAGE. In this case,
- the page will be relocated, relocked and if that all succeeds,
- ->readpage will be called again.
+``read_folio``
+ Called by the page cache to read a folio from the backing store.
+ The 'file' argument supplies authentication information to network
+ filesystems, and is generally not used by block based filesystems.
+ It may be NULL if the caller does not have an open file (eg if
+ the kernel is performing a read for itself rather than on behalf
+ of a userspace process with an open file).
+
+ If the mapping does not support large folios, the folio will
+ contain a single page. The folio will be locked when read_folio
+ is called. If the read completes successfully, the folio should
+ be marked uptodate. The filesystem should unlock the folio
+ once the read has completed, whether it was successful or not.
+ The filesystem does not need to modify the refcount on the folio;
+ the page cache holds a reference count and that will not be
+ released until the folio is unlocked.
+
+ Filesystems may implement ->read_folio() synchronously.
+ In normal operation, folios are read through the ->readahead()
+ method. Only if this fails, or if the caller needs to wait for
+ the read to complete will the page cache call ->read_folio().
+ Filesystems should not attempt to perform their own readahead
+ in the ->read_folio() operation.
+
+ If the filesystem cannot perform the read at this time, it can
+ unlock the folio, do whatever action it needs to ensure that the
+ read will succeed in the future and return AOP_TRUNCATED_PAGE.
+ In this case, the caller should look up the folio, lock it,
+ and call ->read_folio again.
+
+ Callers may invoke the ->read_folio() method directly, but using
+ read_mapping_folio() will take care of locking, waiting for the
+ read to complete and handle cases such as AOP_TRUNCATED_PAGE.
``writepages``
called by the VM to write out pages associated with the
- address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
+ address_space object. If wbc->sync_mode is WB_SYNC_ALL, then
the writeback_control will specify a range of pages that must be
- written out. If it is WBC_SYNC_NONE, then a nr_to_write is
+ written out. If it is WB_SYNC_NONE, then a nr_to_write is
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
DIRTY and will pass them to ->writepage.
-``set_page_dirty``
- called by the VM to set a page dirty. This is particularly
- needed if an address space attaches private data to a page, and
- that data needs to be updated when a page is dirtied. This is
+``dirty_folio``
+ called by the VM to mark a folio as dirty. This is particularly
+ needed if an address space attaches private data to a folio, and
+ that data needs to be updated when a folio is dirtied. This is
called, for example, when a memory mapped page gets modified.
- If defined, it should set the PageDirty flag, and the
- PAGECACHE_TAG_DIRTY tag in the radix tree.
-
-``readpages``
- called by the VM to read pages associated with the address_space
- object. This is essentially just a vector version of readpage.
- Instead of just one page, several pages are requested.
- readpages is only used for read-ahead, so read errors are
- ignored. If anything goes wrong, feel free to give up.
+ If defined, it should set the folio dirty flag, and the
+ PAGECACHE_TAG_DIRTY search mark in i_pages.
+
+``readahead``
+ Called by the VM to read pages associated with the address_space
+ object. The pages are consecutive in the page cache and are
+ locked. The implementation should decrement the page refcount
+ after starting I/O on each page. Usually the page will be
+ unlocked by the I/O completion handler. The set of pages are
+ divided into some sync pages followed by some async pages,
+ rac->ra->async_size gives the number of async pages. The
+ filesystem should attempt to read all sync pages but may decide
+ to stop once it reaches the async pages. If it does decide to
+ stop attempting I/O, it can simply return. The caller will
+ remove the remaining pages from the address space, unlock them
+ and decrement the page refcount. Set PageUptodate if the I/O
+ completes successfully. Setting PageError on any page will be
+ ignored; simply unlock the page if an I/O error occurs.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -805,9 +933,6 @@ cache in your filesystem. The following members are defined:
passed to write_begin is greater than the number of bytes copied
into the page).
- flags is a field for AOP_FLAG_xxx flags, described in
- include/linux/fs.h.
-
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -834,42 +959,41 @@ cache in your filesystem. The following members are defined:
to find out where the blocks in the file are and uses those
addresses directly.
-``invalidatepage``
- If a page has PagePrivate set, then invalidatepage will be
- called when part or all of the page is to be removed from the
+``invalidate_folio``
+ If a folio has private data, then invalidate_folio will be
+ called when part or all of the folio is to be removed from the
address space. This generally corresponds to either a
truncation, punch hole or a complete invalidation of the address
space (in the latter case 'offset' will always be 0 and 'length'
- will be PAGE_SIZE). Any private data associated with the page
+ will be folio_size()). Any private data associated with the folio
should be updated to reflect this truncation. If offset is 0
- and length is PAGE_SIZE, then the private data should be
- released, because the page must be able to be completely
- discarded. This may be done by calling the ->releasepage
+ and length is folio_size(), then the private data should be
+ released, because the folio must be able to be completely
+ discarded. This may be done by calling the ->release_folio
function, but in this case the release MUST succeed.
-``releasepage``
- releasepage is called on PagePrivate pages to indicate that the
- page should be freed if possible. ->releasepage should remove
- any private data from the page and clear the PagePrivate flag.
- If releasepage() fails for some reason, it must indicate failure
- with a 0 return value. releasepage() is used in two distinct
- though related cases. The first is when the VM finds a clean
- page with no active users and wants to make it a free page. If
- ->releasepage succeeds, the page will be removed from the
- address_space and become free.
+``release_folio``
+ release_folio is called on folios with private data to tell the
+ filesystem that the folio is about to be freed. ->release_folio
+ should remove any private data from the folio and clear the
+ private flag. If release_folio() fails, it should return false.
+ release_folio() is used in two distinct though related cases.
+ The first is when the VM wants to free a clean folio with no
+ active users. If ->release_folio succeeds, the folio will be
+ removed from the address_space and be freed.
The second case is when a request has been made to invalidate
- some or all pages in an address_space. This can happen through
- the fadvise(POSIX_FADV_DONTNEED) system call or by the
- filesystem explicitly requesting it as nfs and 9fs do (when they
+ some or all folios in an address_space. This can happen
+ through the fadvise(POSIX_FADV_DONTNEED) system call or by the
+ filesystem explicitly requesting it as nfs and 9p do (when they
believe the cache may be out of date with storage) by calling
invalidate_inode_pages2(). If the filesystem makes such a call,
- and needs to be certain that all pages are invalidated, then its
- releasepage will need to ensure this. Possibly it can clear the
- PageUptodate bit if it cannot free private data yet.
+ and needs to be certain that all folios are invalidated, then
+ its release_folio will need to ensure this. Possibly it can
+ clear the uptodate flag if it cannot free private data yet.
-``freepage``
- freepage is called once the page is no longer visible in the
+``free_folio``
+ free_folio is called once the folio is no longer visible in the
page cache in order to allow the cleanup of any private data.
Since it may be called by the memory reclaimer, it should not
assume that the original address_space mapping still exists, and
@@ -881,59 +1005,57 @@ cache in your filesystem. The following members are defined:
data directly between the storage and the application's address
space.
-``isolate_page``
- Called by the VM when isolating a movable non-lru page. If page
- is successfully isolated, VM marks the page as PG_isolated via
- __SetPageIsolated.
-
-``migrate_page``
+``migrate_folio``
This is used to compact the physical memory usage. If the VM
- wants to relocate a page (maybe off a memory card that is
- signalling imminent failure) it will pass a new page and an old
- page to this function. migrate_page should transfer any private
- data across and update any references that it has to the page.
-
-``putback_page``
- Called by the VM when isolated page's migration fails.
-
-``launder_page``
- Called before freeing a page - it writes back the dirty page.
- To prevent redirtying the page, it is kept locked during the
+ wants to relocate a folio (maybe from a memory device that is
+ signalling imminent failure) it will pass a new folio and an old
+ folio to this function. migrate_folio should transfer any private
+ data across and update any references that it has to the folio.
+
+``launder_folio``
+ Called before freeing a folio - it writes back the dirty folio.
+ To prevent redirtying the folio, it is kept locked during the
whole operation.
``is_partially_uptodate``
Called by the VM when reading a file through the pagecache when
- the underlying blocksize != pagesize. If the required block is
- up to date then the read can complete without needing the IO to
- bring the whole page up to date.
+ the underlying blocksize is smaller than the size of the folio.
+ If the required block is up to date then the read can complete
+ without needing I/O to bring the whole page up to date.
``is_dirty_writeback``
- Called by the VM when attempting to reclaim a page. The VM uses
+ Called by the VM when attempting to reclaim a folio. The VM uses
dirty and writeback information to determine if it needs to
stall to allow flushers a chance to complete some IO.
- Ordinarily it can use PageDirty and PageWriteback but some
- filesystems have more complex state (unstable pages in NFS
+ Ordinarily it can use folio_test_dirty and folio_test_writeback but
+ some filesystems have more complex state (unstable folios in NFS
prevent reclaim) or do not set those flags due to locking
problems. This callback allows a filesystem to indicate to the
- VM if a page should be treated as dirty or writeback for the
+ VM if a folio should be treated as dirty or writeback for the
purposes of stalling.
-``error_remove_page``
- normally set to generic_error_remove_page if truncation is ok
+``error_remove_folio``
+ normally set to generic_error_remove_folio if truncation is ok
for this address space. Used for memory failure handling.
Setting this implies you deal with pages going away under you,
unless you have them locked or reference counts increased.
``swap_activate``
- Called when swapon is used on a file to allocate space if
- necessary and pin the block lookup information in memory. A
- return value of zero indicates success, in which case this file
- can be used to back swapspace.
+
+ Called to prepare the given file for swap. It should perform
+ any validation and preparation necessary to ensure that writes
+ can be performed with minimal memory allocation. It should call
+ add_swap_extent(), or the helper iomap_swapfile_activate(), and
+ return the number of extents added. If IO should be submitted
+ through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+ be submitted directly to the block device ``sis->bdev``.
``swap_deactivate``
Called during swapoff on files where swap_activate was
successful.
+``swap_rw``
+ Called to read or write swap pages when SWP_FS_OPS is set.
The File Object
===============
@@ -958,7 +1080,6 @@ This describes how the VFS can manipulate an open file. As of kernel
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iopoll)(struct kiocb *kiocb, bool spin);
- int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -970,7 +1091,6 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
@@ -1011,12 +1131,8 @@ otherwise noted.
``iopoll``
called when aio wants to poll for completions on HIPRI iocbs
-``iterate``
- called when the VFS needs to read the directory contents
-
``iterate_shared``
- called when the VFS needs to read the directory contents when
- filesystem supports concurrent dir iterators
+ called when the VFS needs to read the directory contents
``poll``
called by the VFS when a process wants to check if there is
@@ -1101,7 +1217,7 @@ otherwise noted.
before any bytes were remapped. The remap_flags parameter
accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the
implementation must only remap if the requested file ranges have
- identical contents. If REMAP_CAN_SHORTEN is set, the caller is
+ identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason).
@@ -1173,7 +1289,7 @@ defined:
return
-ECHILD and it will be called again in ref-walk mode.
-``_weak_revalidate``
+``d_weak_revalidate``
called when the VFS needs to revalidate a "jumped" dentry. This
is called when a path-walk ends at dentry that was not acquired
by doing a lookup in the parent directory. This includes "/",
@@ -1416,13 +1532,13 @@ Resources
version.)
Creating Linux virtual filesystems. 2002
- <http://lwn.net/Articles/13325/>
+ <https://lwn.net/Articles/13325/>
The Linux Virtual File-system Layer by Neil Brown. 1999
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
A tour of the Linux VFS by Michael K. Johnson. 1996
- <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
+ <https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
A small trail through the Linux kernel by Andries Brouwer. 2001
- <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
+ <https://www.win.tue.nl/~aeb/linux/vfs/trail.html>
diff --git a/Documentation/filesystems/virtiofs.rst b/Documentation/filesystems/virtiofs.rst
index e06e4951cb39..fd4d2484e949 100644
--- a/Documentation/filesystems/virtiofs.rst
+++ b/Documentation/filesystems/virtiofs.rst
@@ -39,6 +39,20 @@ Mount file system with tag ``myfs`` on ``/mnt``:
Please see https://virtio-fs.gitlab.io/ for details on how to configure QEMU
and the virtiofsd daemon.
+Mount options
+-------------
+
+virtiofs supports general VFS mount options, for example, remount,
+ro, rw, context, etc. It also supports FUSE mount options.
+
+atime behavior
+^^^^^^^^^^^^^^
+
+The atime-related mount options, for example, noatime, strictatime,
+are ignored. The atime behavior for virtiofs is the same as the
+underlying filesystem of the directory that has been exported
+on the host.
+
Internals
=========
Since the virtio-fs device uses the FUSE protocol for file system requests, the
diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
new file mode 100644
index 000000000000..ab66c57a5d18
--- /dev/null
+++ b/Documentation/filesystems/xfs/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+XFS Filesystem Documentation
+============================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ xfs-delayed-logging-design
+ xfs-maintainer-entry-profile
+ xfs-self-describing-metadata
+ xfs-online-fsck-design
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
index 9a6dd289b17b..6402ab8e370c 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.txt
+++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
@@ -1,31 +1,319 @@
-XFS Delayed Logging Design
---------------------------
-
-Introduction to Re-logging in XFS
----------------------------------
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+XFS Logging Design
+==================
+
+Preamble
+========
+
+This document describes the design and algorithms that the XFS journalling
+subsystem is based on. This document describes the design and algorithms that
+the XFS journalling subsystem is based on so that readers may familiarize
+themselves with the general concepts of how transaction processing in XFS works.
+
+We begin with an overview of transactions in XFS, followed by describing how
+transaction reservations are structured and accounted, and then move into how we
+guarantee forwards progress for long running transactions with finite initial
+reservations bounds. At this point we need to explain how relogging works. With
+the basic concepts covered, the design of the delayed logging mechanism is
+documented.
+
+
+Introduction
+============
+
+XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
+are atomic and recoverable. For reasons of space and time efficiency, the
+logging mechanisms are varied and complex, combining intents, logical and
+physical logging mechanisms to provide the necessary recovery guarantees the
+filesystem requires.
+
+Some objects, such as inodes and dquots, are logged in logical format where the
+details logged are made up of the changes to in-core structures rather than
+on-disk structures. Other objects - typically buffers - have their physical
+changes logged. Long running atomic modifications have individual changes
+chained together by intents, ensuring that journal recovery can restart and
+finish an operation that was only partially done when the system stopped
+functioning.
+
+The reason for these differences is to keep the amount of log space and CPU time
+required to process objects being modified as small as possible and hence the
+logging overhead as low as possible. Some items are very frequently modified,
+and some parts of objects are more frequently modified than others, so keeping
+the overhead of metadata logging low is of prime importance.
+
+The method used to log an item or chain modifications together isn't
+particularly important in the scope of this document. It suffices to know that
+the method used for logging a particular object or chaining modifications
+together are different and are dependent on the object and/or modification being
+performed. The logging subsystem only cares that certain specific rules are
+followed to guarantee forwards progress and prevent deadlocks.
+
+
+Transactions in XFS
+===================
+
+XFS has two types of high level transactions, defined by the type of log space
+reservation they take. These are known as "one shot" and "permanent"
+transactions. Permanent transaction reservations can take reservations that span
+commit boundaries, whilst "one shot" transactions are for a single atomic
+modification.
+
+The type and size of reservation must be matched to the modification taking
+place. This means that permanent transactions can be used for one-shot
+modifications, but one-shot reservations cannot be used for permanent
+transactions.
+
+In the code, a one-shot transaction pattern looks somewhat like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ <lock items>
+ <join item to transaction>
+ <do modification>
+ xfs_trans_commit(tp);
+
+As items are modified in the transaction, the dirty regions in those items are
+tracked via the transaction handle. Once the transaction is committed, all
+resources joined to it are released, along with the remaining unused reservation
+space that was taken at the transaction allocation time.
+
+In contrast, a permanent transaction is made up of multiple linked individual
+transactions, and the pattern looks like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ xfs_ilock(ip, XFS_ILOCK_EXCL)
+
+ loop {
+ xfs_trans_ijoin(tp, 0);
+ <do modification>
+ xfs_trans_log_inode(tp, ip);
+ xfs_trans_roll(&tp);
+ }
+
+ xfs_trans_commit(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+While this might look similar to a one-shot transaction, there is an important
+difference: xfs_trans_roll() performs a specific operation that links two
+transactions together::
+
+ ntp = xfs_trans_dup(tp);
+ xfs_trans_commit(tp);
+ xfs_trans_reserve(ntp);
+
+This results in a series of "rolling transactions" where the inode is locked
+across the entire chain of transactions. Hence while this series of rolling
+transactions is running, nothing else can read from or write to the inode and
+this provides a mechanism for complex changes to appear atomic from an external
+observer's point of view.
+
+It is important to note that a series of rolling transactions in a permanent
+transaction does not form an atomic change in the journal. While each
+individual modification is atomic, the chain is *not atomic*. If we crash half
+way through, then recovery will only replay up to the last transactional
+modification the loop made that was committed to the journal.
+
+This affects long running permanent transactions in that it is not possible to
+predict how much of a long running operation will actually be recovered because
+there is no guarantee of how much of the operation reached stale storage. Hence
+if a long running operation requires multiple transactions to fully complete,
+the high level operation must use intents and deferred operations to guarantee
+recovery can complete the operation once the first transactions is persisted in
+the on-disk journal.
+
+
+Transactions are Asynchronous
+=============================
+
+In XFS, all high level transactions are asynchronous by default. This means that
+xfs_trans_commit() does not guarantee that the modification has been committed
+to stable storage when it returns. Hence when a system crashes, not all the
+completed transactions will be replayed during recovery.
+
+However, the logging subsystem does provide global ordering guarantees, such
+that if a specific change is seen after recovery, all metadata modifications
+that were committed prior to that change will also be seen.
+
+For single shot operations that need to reach stable storage immediately, or
+ensuring that a long running permanent transaction is fully committed once it is
+complete, we can explicitly tag a transaction as synchronous. This will trigger
+a "log force" to flush the outstanding committed transactions to stable storage
+in the journal and wait for that to complete.
+
+Synchronous transactions are rarely used, however, because they limit logging
+throughput to the IO latency limitations of the underlying storage. Instead, we
+tend to use log forces to ensure modifications are on stable storage only when
+a user operation requires a synchronisation point to occur (e.g. fsync).
+
+
+Transaction Reservations
+========================
+
+It has been mentioned a number of times now that the logging subsystem needs to
+provide a forwards progress guarantee so that no modification ever stalls
+because it can't be written to the journal due to a lack of space in the
+journal. This is achieved by the transaction reservations that are made when
+a transaction is first allocated. For permanent transactions, these reservations
+are maintained as part of the transaction rolling mechanism.
+
+A transaction reservation provides a guarantee that there is physical log space
+available to write the modification into the journal before we start making
+modifications to objects and items. As such, the reservation needs to be large
+enough to take into account the amount of metadata that the change might need to
+log in the worst case. This means that if we are modifying a btree in the
+transaction, we have to reserve enough space to record a full leaf-to-root split
+of the btree. As such, the reservations are quite complex because we have to
+take into account all the hidden changes that might occur.
+
+For example, a user data extent allocation involves allocating an extent from
+free space, which modifies the free space trees. That's two btrees. Inserting
+the extent into the inode's extent map might require a split of the extent map
+btree, which requires another allocation that can modify the free space trees
+again. Then we might have to update reverse mappings, which modifies yet
+another btree which might require more space. And so on. Hence the amount of
+metadata that a "simple" operation can modify can be quite large.
+
+This "worst case" calculation provides us with the static "unit reservation"
+for the transaction that is calculated at mount time. We must guarantee that the
+log has this much space available before the transaction is allowed to proceed
+so that when we come to write the dirty metadata into the log we don't run out
+of log space half way through the write.
+
+For one-shot transactions, a single unit space reservation is all that is
+required for the transaction to proceed. For permanent transactions, however, we
+also have a "log count" that affects the size of the reservation that is to be
+made.
+
+While a permanent transaction can get by with a single unit of space
+reservation, it is somewhat inefficient to do this as it requires the
+transaction rolling mechanism to re-reserve space on every transaction roll. We
+know from the implementation of the permanent transactions how many transaction
+rolls are likely for the common modifications that need to be made.
+
+For example, an inode allocation is typically two transactions - one to
+physically allocate a free inode chunk on disk, and another to allocate an inode
+from an inode chunk that has free inodes in it. Hence for an inode allocation
+transaction, we might set the reservation log count to a value of 2 to indicate
+that the common/fast path transaction will commit two linked transactions in a
+chain. Each time a permanent transaction rolls, it consumes an entire unit
+reservation.
+
+Hence when the permanent transaction is first allocated, the log space
+reservation is increased from a single unit reservation to multiple unit
+reservations. That multiple is defined by the reservation log count, and this
+means we can roll the transaction multiple times before we have to re-reserve
+log space when we roll the transaction. This ensures that the common
+modifications we make only need to reserve log space once.
+
+If the log count for a permanent transaction reaches zero, then it needs to
+re-reserve physical space in the log. This is somewhat complex, and requires
+an understanding of how the log accounts for space that has been reserved.
+
+
+Log Space Accounting
+====================
+
+The position in the log is typically referred to as a Log Sequence Number (LSN).
+The log is circular, so the positions in the log are defined by the combination
+of a cycle number - the number of times the log has been overwritten - and the
+offset into the log. A LSN carries the cycle in the upper 32 bits and the
+offset in the lower 32 bits. The offset is in units of "basic blocks" (512
+bytes). Hence we can do realtively simple LSN based math to keep track of
+available space in the log.
+
+Log space accounting is done via a pair of constructs called "grant heads". The
+position of the grant heads is an absolute value, so the amount of space
+available in the log is defined by the distance between the position of the
+grant head and the current log tail. That is, how much space can be
+reserved/consumed before the grant heads would fully wrap the log and overtake
+the tail position.
+
+The first grant head is the "reserve" head. This tracks the byte count of the
+reservations currently held by active transactions. It is a purely in-memory
+accounting of the space reservation and, as such, actually tracks byte offsets
+into the log rather than basic blocks. Hence it technically isn't using LSNs to
+represent the log position, but it is still treated like a split {cycle,offset}
+tuple for the purposes of tracking reservation space.
+
+The reserve grant head is used to accurately account for exact transaction
+reservations amounts and the exact byte count that modifications actually make
+and need to write into the log. The reserve head is used to prevent new
+transactions from taking new reservations when the head reaches the current
+tail. It will block new reservations in a FIFO queue and as the log tail moves
+forward it will wake them in order once sufficient space is available. This FIFO
+mechanism ensures no transaction is starved of resources when log space
+shortages occur.
+
+The other grant head is the "write" head. Unlike the reserve head, this grant
+head contains an LSN and it tracks the physical space usage in the log. While
+this might sound like it is accounting the same state as the reserve grant head
+- and it mostly does track exactly the same location as the reserve grant head -
+there are critical differences in behaviour between them that provides the
+forwards progress guarantees that rolling permanent transactions require.
+
+These differences when a permanent transaction is rolled and the internal "log
+count" reaches zero and the initial set of unit reservations have been
+exhausted. At this point, we still require a log space reservation to continue
+the next transaction in the sequeunce, but we have none remaining. We cannot
+sleep during the transaction commit process waiting for new log space to become
+available, as we may end up on the end of the FIFO queue and the items we have
+locked while we sleep could end up pinning the tail of the log before there is
+enough free space in the log to fulfill all of the pending reservations and
+then wake up transaction commit in progress.
+
+To take a new reservation without sleeping requires us to be able to take a
+reservation even if there is no reservation space currently available. That is,
+we need to be able to *overcommit* the log reservation space. As has already
+been detailed, we cannot overcommit physical log space. However, the reserve
+grant head does not track physical space - it only accounts for the amount of
+reservations we currently have outstanding. Hence if the reserve head passes
+over the tail of the log all it means is that new reservations will be throttled
+immediately and remain throttled until the log tail is moved forward far enough
+to remove the overcommit and start taking new reservations. In other words, we
+can overcommit the reserve head without violating the physical log head and tail
+rules.
+
+As a result, permanent transactions only "regrant" reservation space during
+xfs_trans_commit() calls, while the physical log space reservation - tracked by
+the write head - is then reserved separately by a call to xfs_log_reserve()
+after the commit completes. Once the commit completes, we can sleep waiting for
+physical log space to be reserved from the write grant head, but only if one
+critical rule has been observed::
+
+ Code using permanent reservations must always log the items they hold
+ locked across each transaction they roll in the chain.
+
+"Re-logging" the locked items on every transaction roll ensures that the items
+attached to the transaction chain being rolled are always relocated to the
+physical head of the log and so do not pin the tail of the log. If a locked item
+pins the tail of the log when we sleep on the write reservation, then we will
+deadlock the log as we cannot take the locks needed to write back that item and
+move the tail of the log forwards to free up write grant space. Re-logging the
+locked items avoids this deadlock and guarantees that the log reservation we are
+making cannot self-deadlock.
+
+If all rolling transactions obey this rule, then they can all make forwards
+progress independently because nothing will block the progress of the log
+tail moving forwards and hence ensuring that write grant space is always
+(eventually) made available to permanent transactions no matter how many times
+they roll.
+
+
+Re-logging Explained
+====================
+
+XFS allows multiple separate modifications to a single object to be carried in
+the log at any given time. This allows the log to avoid needing to flush each
+change to disk before recording a new change to the object. XFS does this via a
+method called "re-logging". Conceptually, this is quite simple - all it requires
+is that any new change to the object is recorded with a *new copy* of all the
+existing changes in the new transaction that is written to the log.
That is, if we have a sequence of changes A through to F, and the object was
written to disk after change D, we would see in the log the following series
of transactions, their contents and the log sequence number (LSN) of the
-transaction:
+transaction::
Transaction Contents LSN
A A X
@@ -39,16 +327,13 @@ transaction:
In other words, each time an object is relogged, the new transaction contains
the aggregation of all the previous changes currently held only in the log.
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward. This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
+This relogging technique allows objects to be moved forward in the log so that
+an object being relogged does not prevent the tail of the log from ever moving
+forward. This can be seen in the table above by the changing (increasing) LSN
+of each subsequent transaction, and it's the technique that allows us to
+implement long-running, multiple-commit permanent transactions.
-This relogging is also used to implement long-running, multiple-commit
-transactions. These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
+A typical example of a rolling transaction is the removal of extents from an
inode which can only be done at a rate of two extents per transaction because
of reservation size limitations. Hence a rolling extent removal transaction
keeps relogging the inode and btree buffers as they get modified in each
@@ -64,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get
dirtier as they get relogged, so each subsequent transaction is writing more
metadata into the log.
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
+It should now also be obvious how relogging and asynchronous transactions go
+hand in hand. That is, transactions don't get written to the physical journal
+until either a log buffer is filled (a log buffer can hold multiple
+transactions) or a synchronous operation forces the log buffers holding the
+transactions to disk. This means that XFS is doing aggregation of transactions
+in memory - batching them, if you like - to minimise the impact of the log IO on
+transaction throughput.
The limitation on asynchronous transaction throughput is the number and size of
log buffers made available by the log manager. By default there are 8 log
@@ -85,7 +371,7 @@ IO permanently. Hence the XFS journalling subsystem can be considered to be IO
bound.
Delayed Logging: Concepts
--------------------------
+=========================
The key thing to note about the asynchronous logging combined with the
relogging technique XFS uses is that we can be relogging changed objects
@@ -154,9 +440,10 @@ The fundamental requirements for delayed logging in XFS are simple:
6. No performance regressions for synchronous transaction workloads.
Delayed Logging: Design
------------------------
+=======================
Storing Changes
+---------------
The problem with accumulating changes at a logical level (i.e. just using the
existing log item dirty region tracking) is that when it comes to writing the
@@ -194,30 +481,30 @@ asynchronous transactions to the log. The differences between the existing
formatting method and the delayed logging formatting can be seen in the
diagram below.
-Current format log vector:
+Current format log vector::
-Object +---------------------------------------------+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Object +---------------------------------------------+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
-After formatting:
+After formatting::
-Log Buffer +-V1-+-V2-+----V3----+
+ Log Buffer +-V1-+-V2-+----V3----+
-Delayed logging vector:
+Delayed logging vector::
-Object +---------------------------------------------+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Object +---------------------------------------------+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
-After formatting:
+After formatting::
-Memory Buffer +-V1-+-V2-+----V3----+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Memory Buffer +-V1-+-V2-+----V3----+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
The memory buffer and associated vector need to be passed as a single object,
but still need to be associated with the parent object so if the object is
@@ -242,6 +529,7 @@ relogged in memory.
Tracking Changes
+----------------
Now that we can record transactional changes in memory in a form that allows
them to be used without limitations, we need to be able to track and accumulate
@@ -263,14 +551,14 @@ Essentially, this shows that an item that is in the AIL can still be modified
and relogged, so any tracking must be separate to the AIL infrastructure. As
such, we cannot reuse the AIL list pointers for tracking committed items, nor
can we store state in any field that is protected by the AIL lock. Hence the
-committed item tracking needs it's own locks, lists and state fields in the log
+committed item tracking needs its own locks, lists and state fields in the log
item.
Similar to the AIL, tracking of committed items is done through a new list
called the Committed Item List (CIL). The list tracks log items that have been
committed and have formatted memory buffers attached to them. It tracks objects
in transaction commit order, so when an object is relogged it is removed from
-it's place in the list and re-inserted at the tail. This is entirely arbitrary
+its place in the list and re-inserted at the tail. This is entirely arbitrary
and done to make it easy for debugging - the last items in the list are the
ones that are most recently modified. Ordering of the CIL is not necessary for
transactional integrity (as discussed in the next section) so the ordering is
@@ -278,6 +566,7 @@ done for convenience/sanity of the developers.
Delayed Logging: Checkpoints
+----------------------------
When we have a log synchronisation event, commonly known as a "log force",
all the items in the CIL must be written into the log via the log buffers.
@@ -326,7 +615,7 @@ those changes into the current checkpoint context. We then initialise a new
context and attach that to the CIL for aggregation of new transactions.
This allows us to unlock the CIL immediately after transfer of all the
-committed items and effectively allow new transactions to be issued while we
+committed items and effectively allows new transactions to be issued while we
are formatting the checkpoint into the log. It also allows concurrent
checkpoints to be written into the log buffers in the case of log force heavy
workloads, just like the existing transaction commit code does. This, however,
@@ -341,7 +630,7 @@ Hence log vectors need to be able to be chained together to allow them to be
detached from the log items. That is, when the CIL is flushed the memory
buffer and log vector attached to each log item needs to be attached to the
checkpoint context so that the log item can be released. In diagrammatic form,
-the CIL would look like this before the flush:
+the CIL would look like this before the flush::
CIL Head
|
@@ -362,7 +651,7 @@ the CIL would look like this before the flush:
-> vector array
And after the flush the CIL head is empty, and the checkpoint context log
-vector list would look like:
+vector list would look like::
Checkpoint Context
|
@@ -411,6 +700,7 @@ compare" situation that can be done after a working and reviewed implementation
is in the dev tree....
Delayed Logging: Checkpoint Sequencing
+--------------------------------------
One of the key aspects of the XFS transaction subsystem is that it tags
committed transactions with the log sequence number of the transaction commit.
@@ -474,6 +764,7 @@ force the log at the LSN of that transaction) and so the higher level code
behaves the same regardless of whether delayed logging is being used or not.
Delayed Logging: Checkpoint Log Space Accounting
+------------------------------------------------
The big issue for a checkpoint transaction is the log space reservation for the
transaction. We don't know how big a checkpoint transaction is going to be
@@ -491,7 +782,7 @@ the size of the transaction and the number of regions being logged (the number
of log vectors in the transaction).
An example of the differences would be logging directory changes versus logging
-inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
+inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
there are lots of transactions that only contain an inode core and an inode log
format structure. That is, two vectors totaling roughly 150 bytes. If we modify
10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
@@ -565,6 +856,7 @@ which is once every 30s.
Delayed Logging: Log Item Pinning
+---------------------------------
Currently log items are pinned during transaction commit while the items are
still locked. This happens just after the items are formatted, though it could
@@ -592,9 +884,9 @@ pin the object the first time it is inserted into the CIL - if it is already in
the CIL during a transaction commit, then we do not pin it again. Because there
can be multiple outstanding checkpoint contexts, we can still see elevated pin
counts, but as each checkpoint completes the pin count will retain the correct
-value according to it's context.
+value according to its context.
-Just to make matters more slightly more complex, this checkpoint level context
+Just to make matters slightly more complex, this checkpoint level context
for the pin count means that the pinning of an item must take place under the
CIL commit/flush lock. If we pin the object outside this lock, we cannot
guarantee which context the pin count is associated with. This is because of
@@ -605,6 +897,7 @@ object, we have a race with CIL being flushed between the check and the pin
lock to guarantee that we pin the items correctly.
Delayed Logging: Concurrent Scalability
+---------------------------------------
A fundamental requirement for the CIL is that accesses through transaction
commits must scale to many concurrent commits. The current transaction commit
@@ -683,8 +976,9 @@ woken by the wrong event.
Lifecycle Changes
+-----------------
-The existing log item life cycle is as follows:
+The existing log item life cycle is as follows::
1. Transaction allocate
2. Transaction reserve
@@ -729,7 +1023,7 @@ at the same time. If the log item is in the AIL or between steps 6 and 7
and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
are entered and completed is the object considered clean.
-With delayed logging, there are new steps inserted into the life cycle:
+With delayed logging, there are new steps inserted into the life cycle::
1. Transaction allocate
2. Transaction reserve
diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
new file mode 100644
index 000000000000..32b6ac4ca9d6
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
@@ -0,0 +1,194 @@
+XFS Maintainer Entry Profile
+============================
+
+Overview
+--------
+XFS is a well known high-performance filesystem in the Linux kernel.
+The aim of this project is to provide and maintain a robust and
+performant filesystem.
+
+Patches are generally merged to the for-next branch of the appropriate
+git repository.
+After a testing period, the for-next branch is merged to the master
+branch.
+
+Kernel code are merged to the xfs-linux tree[0].
+Userspace code are merged to the xfsprogs tree[1].
+Test cases are merged to the xfstests tree[2].
+Ondisk format documentation are merged to the xfs-documentation tree[3].
+
+All patchsets involving XFS *must* be cc'd in their entirety to the mailing
+list linux-xfs@vger.kernel.org.
+
+Roles
+-----
+There are eight key roles in the XFS project.
+A person can take on multiple roles, and a role can be filled by
+multiple people.
+Anyone taking on a role is advised to check in with themselves and
+others on a regular basis about burnout.
+
+- **Outside Contributor**: Anyone who sends a patch but is not involved
+ in the XFS project on a regular basis.
+ These folks are usually people who work on other filesystems or
+ elsewhere in the kernel community.
+
+- **Developer**: Someone who is familiar with the XFS codebase enough to
+ write new code, documentation, and tests.
+
+ Developers can often be found in the IRC channel mentioned by the ``C:``
+ entry in the kernel MAINTAINERS file.
+
+- **Senior Developer**: A developer who is very familiar with at least
+ some part of the XFS codebase and/or other subsystems in the kernel.
+ These people collectively decide the long term goals of the project
+ and nudge the community in that direction.
+ They should help prioritize development and review work for each release
+ cycle.
+
+ Senior developers tend to be more active participants in the IRC channel.
+
+- **Reviewer**: Someone (most likely also a developer) who reads code
+ submissions to decide:
+
+ 0. Is the idea behind the contribution sound?
+ 1. Does the idea fit the goals of the project?
+ 2. Is the contribution designed correctly?
+ 3. Is the contribution polished?
+ 4. Can the contribution be tested effectively?
+
+ Reviewers should identify themselves with an ``R:`` entry in the kernel
+ and fstests MAINTAINERS files.
+
+- **Testing Lead**: This person is responsible for setting the test
+ coverage goals of the project, negotiating with developers to decide
+ on new tests for new features, and making sure that developers and
+ release managers execute on the testing.
+
+ The testing lead should identify themselves with an ``M:`` entry in
+ the XFS section of the fstests MAINTAINERS file.
+
+- **Bug Triager**: Someone who examines incoming bug reports in just
+ enough detail to identify the person to whom the report should be
+ forwarded.
+
+ The bug triagers should identify themselves with a ``B:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Release Manager**: This person merges reviewed patchsets into an
+ integration branch, tests the result locally, pushes the branch to a
+ public git repository, and sends pull requests further upstream.
+ The release manager is not expected to work on new feature patchsets.
+ If a developer and a reviewer fail to reach a resolution on some point,
+ the release manager must have the ability to intervene to try to drive a
+ resolution.
+
+ The release manager should identify themselves with an ``M:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Community Manager**: This person calls and moderates meetings of as many
+ XFS participants as they can get when mailing list discussions prove
+ insufficient for collective decisionmaking.
+ They may also serve as liaison between managers of the organizations
+ sponsoring work on any part of XFS.
+
+- **LTS Maintainer**: Someone who backports and tests bug fixes from
+ uptream to the LTS kernels.
+ There tend to be six separate LTS trees at any given time.
+
+ The maintainer for a given LTS release should identify themselves with an
+ ``M:`` entry in the MAINTAINERS file for that LTS tree.
+ Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that
+ same file.
+
+Submission Checklist Addendum
+-----------------------------
+Please follow these additional rules when submitting to XFS:
+
+- Patches affecting only the filesystem itself should be based against
+ the latest -rc or the for-next branch.
+ These patches will be merged back to the for-next branch.
+
+- Authors of patches touching other subsystems need to coordinate with
+ the maintainers of XFS and the relevant subsystems to decide how to
+ proceed with a merge.
+
+- Any patchset changing XFS should be cc'd in its entirety to linux-xfs.
+ Do not send partial patchsets; that makes analysis of the broader
+ context of the changes unnecessarily difficult.
+
+- Anyone making kernel changes that have corresponding changes to the
+ userspace utilities should send the userspace changes as separate
+ patchsets immediately after the kernel patchsets.
+
+- Authors of bug fix patches are expected to use fstests[2] to perform
+ an A/B test of the patch to determine that there are no regressions.
+ When possible, a new regression test case should be written for
+ fstests.
+
+- Authors of new feature patchsets must ensure that fstests will have
+ appropriate functional and input corner-case test cases for the new
+ feature.
+
+- When implementing a new feature, it is strongly suggested that the
+ developers write a design document to answer the following questions:
+
+ * **What** problem is this trying to solve?
+
+ * **Who** will benefit from this solution, and **where** will they
+ access it?
+
+ * **How** will this new feature work? This should touch on major data
+ structures and algorithms supporting the solution at a higher level
+ than code comments.
+
+ * **What** userspace interfaces are necessary to build off of the new
+ features?
+
+ * **How** will this work be tested to ensure that it solves the
+ problems laid out in the design document without causing new
+ problems?
+
+ The design document should be committed in the kernel documentation
+ directory.
+ It may be omitted if the feature is already well known to the
+ community.
+
+- Patchsets for the new tests should be submitted as separate patchsets
+ immediately after the kernel and userspace code patchsets.
+
+- Changes to the on-disk format of XFS must be described in the ondisk
+ format document[3] and submitted as a patchset after the fstests
+ patchsets.
+
+- Patchsets implementing bug fixes and further code cleanups should put
+ the bug fixes at the beginning of the series to ease backporting.
+
+Key Release Cycle Dates
+-----------------------
+Bug fixes may be sent at any time, though the release manager may decide to
+defer a patch when the next merge window is close.
+
+Code submissions targeting the next merge window should be sent between
+-rc1 and -rc6.
+This gives the community time to review the changes, to suggest other changes,
+and for the author to retest those changes.
+
+Code submissions also requiring changes to fs/iomap and targeting the
+next merge window should be sent between -rc1 and -rc4.
+This allows the broader kernel community adequate time to test the
+infrastructure changes.
+
+Review Cadence
+--------------
+In general, please wait at least one week before pinging for feedback.
+To find reviewers, either consult the MAINTAINERS file, or ask
+developers that have Reviewed-by tags for XFS changes to take a look and
+offer their opinion.
+
+References
+----------
+| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/
+| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/
+| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
+| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
new file mode 100644
index 000000000000..352516feef6f
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -0,0 +1,5315 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_online_fsck_design:
+
+..
+ Mapping of heading styles within this document:
+ Heading 1 uses "====" above and below
+ Heading 2 uses "===="
+ Heading 3 uses "----"
+ Heading 4 uses "````"
+ Heading 5 uses "^^^^"
+ Heading 6 uses "~~~~"
+ Heading 7 uses "...."
+
+ Sections are manually numbered because apparently that's what everyone
+ does in the kernel.
+
+======================
+XFS Online Fsck Design
+======================
+
+This document captures the design of the online filesystem check feature for
+XFS.
+The purpose of this document is threefold:
+
+- To help kernel distributors understand exactly what the XFS online fsck
+ feature is, and issues about which they should be aware.
+
+- To help people reading the code to familiarize themselves with the relevant
+ concepts and design points before they start digging into the code.
+
+- To help developers maintaining the system by capturing the reasons
+ supporting higher level decision making.
+
+As the online fsck code is merged, the links in this document to topic branches
+will be replaced with links to code.
+
+This document is licensed under the terms of the GNU Public License, v2.
+The primary author is Darrick J. Wong.
+
+This design document is split into seven parts.
+Part 1 defines what fsck tools are and the motivations for writing a new one.
+Parts 2 and 3 present a high level overview of how online fsck process works
+and how it is tested to ensure correct functionality.
+Part 4 discusses the user interface and the intended usage modes of the new
+program.
+Parts 5 and 6 show off the high level components and how they fit together, and
+then present case studies of how each repair function actually works.
+Part 7 sums up what has been discussed so far and speculates about what else
+might be built atop online fsck.
+
+.. contents:: Table of Contents
+ :local:
+
+1. What is a Filesystem Check?
+==============================
+
+A Unix filesystem has four main responsibilities:
+
+- Provide a hierarchy of names through which application programs can associate
+ arbitrary blobs of data for any length of time,
+
+- Virtualize physical storage media across those names, and
+
+- Retrieve the named data blobs at any time.
+
+- Examine resource usage.
+
+Metadata directly supporting these functions (e.g. files, directories, space
+mappings) are sometimes called primary metadata.
+Secondary metadata (e.g. reverse mapping and directory parent pointers) support
+operations internal to the filesystem, such as internal consistency checking
+and reorganization.
+Summary metadata, as the name implies, condense information contained in
+primary metadata for performance reasons.
+
+The filesystem check (fsck) tool examines all the metadata in a filesystem
+to look for errors.
+In addition to looking for obvious metadata corruptions, fsck also
+cross-references different types of metadata records with each other to look
+for inconsistencies.
+People do not like losing data, so most fsck tools also contains some ability
+to correct any problems found.
+As a word of caution -- the primary goal of most Linux fsck tools is to restore
+the filesystem metadata to a consistent state, not to maximize the data
+recovered.
+That precedent will not be challenged here.
+
+Filesystems of the 20th century generally lacked any redundancy in the ondisk
+format, which means that fsck can only respond to errors by erasing files until
+errors are no longer detected.
+More recent filesystem designs contain enough redundancy in their metadata that
+it is now possible to regenerate data structures when non-catastrophic errors
+occur; this capability aids both strategies.
+
++--------------------------------------------------------------------------+
+| **Note**: |
++--------------------------------------------------------------------------+
+| System administrators avoid data loss by increasing the number of |
+| separate storage systems through the creation of backups; and they avoid |
+| downtime by increasing the redundancy of each storage system through the |
+| creation of RAID arrays. |
+| fsck tools address only the first problem. |
++--------------------------------------------------------------------------+
+
+TLDR; Show Me the Code!
+-----------------------
+
+Code is posted to the kernel.org git trees as follows:
+`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
+`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
+`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Each kernel patchset adding an online repair function will use the same branch
+name across the kernel, xfsprogs, and fstests git repos.
+
+Existing Tools
+--------------
+
+The online fsck tool described here will be the third tool in the history of
+XFS (on Linux) to check and repair filesystems.
+Two programs precede it:
+
+The first program, ``xfs_check``, was created as part of the XFS debugger
+(``xfs_db``) and can only be used with unmounted filesystems.
+It walks all metadata in the filesystem looking for inconsistencies in the
+metadata, though it lacks any ability to repair what it finds.
+Due to its high memory requirements and inability to repair things, this
+program is now deprecated and will not be discussed further.
+
+The second program, ``xfs_repair``, was created to be faster and more robust
+than the first program.
+Like its predecessor, it can only be used with unmounted filesystems.
+It uses extent-based in-memory data structures to reduce memory consumption,
+and tries to schedule readahead IO appropriately to reduce I/O waiting time
+while it scans the metadata of the entire filesystem.
+The most important feature of this tool is its ability to respond to
+inconsistencies in file metadata and directory tree by erasing things as needed
+to eliminate problems.
+Space usage metadata are rebuilt from the observed file metadata.
+
+Problem Statement
+-----------------
+
+The current XFS tools leave several problems unsolved:
+
+1. **User programs** suddenly **lose access** to the filesystem when unexpected
+ shutdowns occur as a result of silent corruptions in the metadata.
+ These occur **unpredictably** and often without warning.
+
+2. **Users** experience a **total loss of service** during the recovery period
+ after an **unexpected shutdown** occurs.
+
+3. **Users** experience a **total loss of service** if the filesystem is taken
+ offline to **look for problems** proactively.
+
+4. **Data owners** cannot **check the integrity** of their stored data without
+ reading all of it.
+ This may expose them to substantial billing costs when a linear media scan
+ performed by the storage system administrator might suffice.
+
+5. **System administrators** cannot **schedule** a maintenance window to deal
+ with corruptions if they **lack the means** to assess filesystem health
+ while the filesystem is online.
+
+6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
+ health when doing so requires **manual intervention** and downtime.
+
+7. **Users** can be tricked into **doing things they do not desire** when
+ malicious actors **exploit quirks of Unicode** to place misleading names
+ in directories.
+
+Given this definition of the problems to be solved and the actors who would
+benefit, the proposed solution is a third fsck tool that acts on a running
+filesystem.
+
+This new third program has three components: an in-kernel facility to check
+metadata, an in-kernel facility to repair metadata, and a userspace driver
+program to drive fsck activity on a live filesystem.
+``xfs_scrub`` is the name of the driver program.
+The rest of this document presents the goals and use cases of the new fsck
+tool, describes its major design points in connection to those goals, and
+discusses the similarities and differences with existing tools.
+
++--------------------------------------------------------------------------+
+| **Note**: |
++--------------------------------------------------------------------------+
+| Throughout this document, the existing offline fsck tool can also be |
+| referred to by its current name "``xfs_repair``". |
+| The userspace driver program for the new online fsck tool can be |
+| referred to as "``xfs_scrub``". |
+| The kernel portion of online fsck that validates metadata is called |
+| "online scrub", and portion of the kernel that fixes metadata is called |
+| "online repair". |
++--------------------------------------------------------------------------+
+
+The naming hierarchy is broken up into objects known as directories and files
+and the physical space is split into pieces known as allocation groups.
+Sharding enables better performance on highly parallel systems and helps to
+contain the damage when corruptions occur.
+The division of the filesystem into principal objects (allocation groups and
+inodes) means that there are ample opportunities to perform targeted checks and
+repairs on a subset of the filesystem.
+
+While this is going on, other parts continue processing IO requests.
+Even if a piece of filesystem metadata can only be regenerated by scanning the
+entire system, the scan can still be done in the background while other file
+operations continue.
+
+In summary, online fsck takes advantage of resource sharding and redundant
+metadata to enable targeted checking and repair operations while the system
+is running.
+This capability will be coupled to automatic system management so that
+autonomous self-healing of XFS maximizes service availability.
+
+2. Theory of Operation
+======================
+
+Because it is necessary for online fsck to lock and scan live metadata objects,
+online fsck consists of three separate code components.
+The first is the userspace driver program ``xfs_scrub``, which is responsible
+for identifying individual metadata items, scheduling work items for them,
+reacting to the outcomes appropriately, and reporting results to the system
+administrator.
+The second and third are in the kernel, which implements functions to check
+and repair each type of online fsck work item.
+
++------------------------------------------------------------------+
+| **Note**: |
++------------------------------------------------------------------+
+| For brevity, this document shortens the phrase "online fsck work |
+| item" to "scrub item". |
++------------------------------------------------------------------+
+
+Scrub item types are delineated in a manner consistent with the Unix design
+philosophy, which is to say that each item should handle one aspect of a
+metadata structure, and handle it well.
+
+Scope
+-----
+
+In principle, online fsck should be able to check and to repair everything that
+the offline fsck program can handle.
+However, online fsck cannot be running 100% of the time, which means that
+latent errors may creep in after a scrub completes.
+If these errors cause the next mount to fail, offline fsck is the only
+solution.
+This limitation means that maintenance of the offline fsck tool will continue.
+A second limitation of online fsck is that it must follow the same resource
+sharing and lock acquisition rules as the regular filesystem.
+This means that scrub cannot take *any* shortcuts to save time, because doing
+so could lead to concurrency problems.
+In other words, online fsck is not a complete replacement for offline fsck, and
+a complete run of online fsck may take longer than online fsck.
+However, both of these limitations are acceptable tradeoffs to satisfy the
+different motivations of online fsck, which are to **minimize system downtime**
+and to **increase predictability of operation**.
+
+.. _scrubphases:
+
+Phases of Work
+--------------
+
+The userspace driver program ``xfs_scrub`` splits the work of checking and
+repairing an entire filesystem into seven phases.
+Each phase concentrates on checking specific types of scrub items and depends
+on the success of all previous phases.
+The seven phases are as follows:
+
+1. Collect geometry information about the mounted filesystem and computer,
+ discover the online fsck capabilities of the kernel, and open the
+ underlying storage devices.
+
+2. Check allocation group metadata, all realtime volume metadata, and all quota
+ files.
+ Each metadata structure is scheduled as a separate scrub item.
+ If corruption is found in the inode header or inode btree and ``xfs_scrub``
+ is permitted to perform repairs, then those scrub items are repaired to
+ prepare for phase 3.
+ Repairs are implemented by using the information in the scrub item to
+ resubmit the kernel scrub call with the repair flag enabled; this is
+ discussed in the next section.
+ Optimizations and all other repairs are deferred to phase 4.
+
+3. Check all metadata of every file in the filesystem.
+ Each metadata structure is also scheduled as a separate scrub item.
+ If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
+ and there were no problems detected during phase 2, then those scrub items
+ are repaired immediately.
+ Optimizations, deferred repairs, and unsuccessful repairs are deferred to
+ phase 4.
+
+4. All remaining repairs and scheduled optimizations are performed during this
+ phase, if the caller permits them.
+ Before starting repairs, the summary counters are checked and any necessary
+ repairs are performed so that subsequent repairs will not fail the resource
+ reservation step due to wildly incorrect summary counters.
+ Unsuccessful repairs are requeued as long as forward progress on repairs is
+ made somewhere in the filesystem.
+ Free space in the filesystem is trimmed at the end of phase 4 if the
+ filesystem is clean.
+
+5. By the start of this phase, all primary and secondary filesystem metadata
+ must be correct.
+ Summary counters such as the free space counts and quota resource counts
+ are checked and corrected.
+ Directory entry names and extended attribute names are checked for
+ suspicious entries such as control characters or confusing Unicode sequences
+ appearing in names.
+
+6. If the caller asks for a media scan, read all allocated and written data
+ file extents in the filesystem.
+ The ability to use hardware-assisted data file integrity checking is new
+ to online fsck; neither of the previous tools have this capability.
+ If media errors occur, they will be mapped to the owning files and reported.
+
+7. Re-check the summary counters and presents the caller with a summary of
+ space usage and file counts.
+
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
+Steps for Each Scrub Item
+-------------------------
+
+The kernel scrub code uses a three-step strategy for checking and repairing
+the one aspect of a metadata object represented by a scrub item:
+
+1. The scrub item of interest is checked for corruptions; opportunities for
+ optimization; and for values that are directly controlled by the system
+ administrator but look suspicious.
+ If the item is not corrupt or does not need optimization, resource are
+ released and the positive scan results are returned to userspace.
+ If the item is corrupt or could be optimized but the caller does not permit
+ this, resources are released and the negative scan results are returned to
+ userspace.
+ Otherwise, the kernel moves on to the second step.
+
+2. The repair function is called to rebuild the data structure.
+ Repair functions generally choose rebuild a structure from other metadata
+ rather than try to salvage the existing structure.
+ If the repair fails, the scan results from the first step are returned to
+ userspace.
+ Otherwise, the kernel moves on to the third step.
+
+3. In the third step, the kernel runs the same checks over the new metadata
+ item to assess the efficacy of the repairs.
+ The results of the reassessment are returned to userspace.
+
+Classification of Metadata
+--------------------------
+
+Each type of metadata object (and therefore each type of scrub item) is
+classified as follows:
+
+Primary Metadata
+````````````````
+
+Metadata structures in this category should be most familiar to filesystem
+users either because they are directly created by the user or they index
+objects created by the user
+Most filesystem objects fall into this class:
+
+- Free space and reference count information
+
+- Inode records and indexes
+
+- Storage mapping information for file data
+
+- Directories
+
+- Extended attributes
+
+- Symbolic links
+
+- Quota limits
+
+Scrub obeys the same rules as regular filesystem accesses for resource and lock
+acquisition.
+
+Primary metadata objects are the simplest for scrub to process.
+The principal filesystem object (either an allocation group or an inode) that
+owns the item being scrubbed is locked to guard against concurrent updates.
+The check function examines every record associated with the type for obvious
+errors and cross-references healthy records against other metadata to look for
+inconsistencies.
+Repairs for this class of scrub item are simple, since the repair function
+starts by holding all the resources acquired in the previous step.
+The repair function scans available metadata as needed to record all the
+observations needed to complete the structure.
+Next, it stages the observations in a new ondisk structure and commits it
+atomically to complete the repair.
+Finally, the storage from the old data structure are carefully reaped.
+
+Because ``xfs_scrub`` locks a primary object for the duration of the repair,
+this is effectively an offline repair operation performed on a subset of the
+filesystem.
+This minimizes the complexity of the repair code because it is not necessary to
+handle concurrent updates from other threads, nor is it necessary to access
+any other part of the filesystem.
+As a result, indexed structures can be rebuilt very quickly, and programs
+trying to access the damaged structure will be blocked until repairs complete.
+The only infrastructure needed by the repair code are the staging area for
+observations and a means to write new structures to disk.
+Despite these limitations, the advantage that online repair holds is clear:
+targeted work on individual shards of the filesystem avoids total loss of
+service.
+
+This mechanism is described in section 2.1 ("Off-Line Algorithm") of
+V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
+Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
+*Extending Database Technology*, pp. 293-309, 1992.
+
+Most primary metadata repair functions stage their intermediate results in an
+in-memory array prior to formatting the new ondisk structure, which is very
+similar to the list-based algorithm discussed in section 2.3 ("List-Based
+Algorithms") of Srinivasan.
+However, any data structure builder that maintains a resource lock for the
+duration of the repair is *always* an offline algorithm.
+
+.. _secondary_metadata:
+
+Secondary Metadata
+``````````````````
+
+Metadata structures in this category reflect records found in primary metadata,
+but are only needed for online fsck or for reorganization of the filesystem.
+
+Secondary metadata include:
+
+- Reverse mapping information
+
+- Directory parent pointers
+
+This class of metadata is difficult for scrub to process because scrub attaches
+to the secondary object but needs to check primary metadata, which runs counter
+to the usual order of resource acquisition.
+Frequently, this means that full filesystems scans are necessary to rebuild the
+metadata.
+Check functions can be limited in scope to reduce runtime.
+Repairs, however, require a full scan of primary metadata, which can take a
+long time to complete.
+Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
+duration of the repair.
+
+Instead, repair functions set up an in-memory staging structure to store
+observations.
+Depending on the requirements of the specific repair function, the staging
+index will either have the same format as the ondisk structure or a design
+specific to that repair function.
+The next step is to release all locks and start the filesystem scan.
+When the repair scanner needs to record an observation, the staging data are
+locked long enough to apply the update.
+While the filesystem scan is in progress, the repair function hooks the
+filesystem so that it can apply pending filesystem updates to the staging
+information.
+Once the scan is done, the owning object is re-locked, the live data is used to
+write a new ondisk structure, and the repairs are committed atomically.
+The hooks are disabled and the staging staging area is freed.
+Finally, the storage from the old data structure are carefully reaped.
+
+Introducing concurrency helps online repair avoid various locking problems, but
+comes at a high cost to code complexity.
+Live filesystem code has to be hooked so that the repair function can observe
+updates in progress.
+The staging area has to become a fully functional parallel structure so that
+updates can be merged from the hooks.
+Finally, the hook, the filesystem scan, and the inode locking model must be
+sufficiently well integrated that a hook event can decide if a given update
+should be applied to the staging structure.
+
+In theory, the scrub implementation could apply these same techniques for
+primary metadata, but doing so would make it massively more complex and less
+performant.
+Programs attempting to access the damaged structures are not blocked from
+operation, which may cause application failure or an unplanned filesystem
+shutdown.
+
+Inspiration for the secondary metadata repair strategy was drawn from section
+2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
+Creating Indexes for Very Large Tables Without Quiescing Updates"
+<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
+
+The sidecar index mentioned above bears some resemblance to the side file
+method mentioned in Srinivasan and Mohan.
+Their method consists of an index builder that extracts relevant record data to
+build the new structure as quickly as possible; and an auxiliary structure that
+captures all updates that would be committed to the index by other threads were
+the new index already online.
+After the index building scan finishes, the updates recorded in the side file
+are applied to the new index.
+To avoid conflicts between the index builder and other writer threads, the
+builder maintains a publicly visible cursor that tracks the progress of the
+scan through the record space.
+To avoid duplication of work between the side file and the index builder, side
+file updates are elided when the record ID for the update is greater than the
+cursor position within the record ID space.
+
+To minimize changes to the rest of the codebase, XFS online repair keeps the
+replacement index hidden until it's completely ready to go.
+In other words, there is no attempt to expose the keyspace of the new index
+while repair is running.
+The complexity of such an approach would be very high and perhaps more
+appropriate to building *new* indices.
+
+**Future Work Question**: Can the full scan and live update code used to
+facilitate a repair also be used to implement a comprehensive check?
+
+*Answer*: In theory, yes. Check would be much stronger if each scrub function
+employed these live scans to build a shadow copy of the metadata and then
+compared the shadow records to the ondisk records.
+However, doing that is a fair amount more work than what the checking functions
+do now.
+The live scans and hooks were developed much later.
+That in turn increases the runtime of those scrub functions.
+
+Summary Information
+```````````````````
+
+Metadata structures in this last category summarize the contents of primary
+metadata records.
+These are often used to speed up resource usage queries, and are many times
+smaller than the primary metadata which they represent.
+
+Examples of summary information include:
+
+- Summary counts of free space and inodes
+
+- File link counts from directories
+
+- Quota resource usage counts
+
+Check and repair require full filesystem scans, but resource and lock
+acquisition follow the same paths as regular filesystem accesses.
+
+The superblock summary counters have special requirements due to the underlying
+implementation of the incore counters, and will be treated separately.
+Check and repair of the other types of summary counters (quota resource counts
+and file link counts) employ the same filesystem scanning and hooking
+techniques as outlined above, but because the underlying data are sets of
+integer counters, the staging data need not be a fully functional mirror of the
+ondisk structure.
+
+Inspiration for quota and file link count repair strategies were drawn from
+sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
+Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views
+and Their Indexes"
+<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
+
+Since quotas are non-negative integer counts of resource usage, online
+quotacheck can use the incremental view deltas described in section 2.14 to
+track pending changes to the block and inode usage counts in each transaction,
+and commit those changes to a dquot side file when the transaction commits.
+Delta tracking is necessary for dquots because the index builder scans inodes,
+whereas the data structure being rebuilt is an index of dquots.
+Link count checking combines the view deltas and commit step into one because
+it sets attributes of the objects being scanned instead of writing them to a
+separate data structure.
+Each online fsck function will be discussed as case studies later in this
+document.
+
+Risk Management
+---------------
+
+During the development of online fsck, several risk factors were identified
+that may make the feature unsuitable for certain distributors and users.
+Steps can be taken to mitigate or eliminate those risks, though at a cost to
+functionality.
+
+- **Decreased performance**: Adding metadata indices to the filesystem
+ increases the time cost of persisting changes to disk, and the reverse space
+ mapping and directory parent pointers are no exception.
+ System administrators who require the maximum performance can disable the
+ reverse mapping features at format time, though this choice dramatically
+ reduces the ability of online fsck to find inconsistencies and repair them.
+
+- **Incorrect repairs**: As with all software, there might be defects in the
+ software that result in incorrect repairs being written to the filesystem.
+ Systematic fuzz testing (detailed in the next section) is employed by the
+ authors to find bugs early, but it might not catch everything.
+ The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
+ and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
+ accept this risk.
+ The xfsprogs build system has a configure option (``--enable-scrub=no``) that
+ disables building of the ``xfs_scrub`` binary, though this is not a risk
+ mitigation if the kernel functionality remains enabled.
+
+- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
+ repairable.
+ If the keyspaces of several metadata indices overlap in some manner but a
+ coherent narrative cannot be formed from records collected, then the repair
+ fails.
+ To reduce the chance that a repair will fail with a dirty transaction and
+ render the filesystem unusable, the online repair functions have been
+ designed to stage and validate all new records before committing the new
+ structure.
+
+- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
+ devices, opening files by handle, ignoring Unix discretionary access control,
+ and the ability to perform administrative changes.
+ Running this automatically in the background scares people, so the systemd
+ background service is configured to run with only the privileges required.
+ Obviously, this cannot address certain problems like the kernel crashing or
+ deadlocking, but it should be sufficient to prevent the scrub process from
+ escaping and reconfiguring the system.
+ The cron job does not have this protection.
+
+- **Fuzz Kiddiez**: There are many people now who seem to think that running
+ automated fuzz testing of ondisk artifacts to find mischievous behavior and
+ spraying exploit code onto the public mailing list for instant zero-day
+ disclosure is somehow of some social benefit.
+ In the view of this author, the benefit is realized only when the fuzz
+ operators help to **fix** the flaws, but this opinion apparently is not
+ widely shared among security "researchers".
+ The XFS maintainers' continuing ability to manage these events presents an
+ ongoing risk to the stability of the development process.
+ Automated testing should front-load some of the risk while the feature is
+ considered EXPERIMENTAL.
+
+Many of these risks are inherent to software programming.
+Despite this, it is hoped that this new functionality will prove useful in
+reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+ * Write garbage to it
+
+ * Test the reactions of:
+
+ 1. The kernel verifiers to stop obviously bad metadata
+ 2. Offline repair (``xfs_repair``) to detect and fix
+ 3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+The testing plan for online fsck includes extending the existing fs testing
+infrastructure to provide a much more powerful facility: targeted fuzz testing
+of every metadata field of every metadata object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+ * For each record inside that metadata object...
+
+ * For each field inside that record...
+
+ * For each conceivable type of transformation that can be applied to a bit field...
+
+ 1. Clear all bits
+ 2. Set all bits
+ 3. Toggle the most significant bit
+ 4. Toggle the middle bit
+ 5. Toggle the least significant bit
+ 6. Add a small quantity
+ 7. Subtract a small quantity
+ 8. Randomize the contents
+
+ * ...test the reactions of:
+
+ 1. The kernel verifiers to stop obviously bad metadata
+ 2. Offline checking (``xfs_repair -n``)
+ 3. Offline repair (``xfs_repair``)
+ 4. Online checking (``xfs_scrub -n``)
+ 5. Online repair (``xfs_scrub``)
+ 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+ while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+ while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+ filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+ force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+ freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+ remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
+
+4. User Interface
+=================
+
+The primary user of online fsck is the system administrator, just like offline
+repair.
+Online fsck presents two modes of operation to administrators:
+A foreground CLI process for online fsck on demand, and a background service
+that performs autonomous checking and repair.
+
+Checking on Demand
+------------------
+
+For administrators who want the absolute freshest information about the
+metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
+a command line.
+The program checks every piece of metadata in the filesystem while the
+administrator waits for the results to be reported, just like the existing
+``xfs_repair`` tool.
+Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
+option to increase the verbosity of the information reported.
+
+A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
+correction capabilities of the hardware to check data file contents.
+The media scan is not enabled by default because it may dramatically increase
+program runtime and consume a lot of bandwidth on older storage hardware.
+
+The output of a foreground invocation is captured in the system log.
+
+The ``xfs_scrub_all`` program walks the list of mounted filesystems and
+initiates ``xfs_scrub`` for each of them in parallel.
+It serializes scans for any filesystems that resolve to the same top level
+kernel block device to prevent resource overconsumption.
+
+Background Service
+------------------
+
+To reduce the workload of system administrators, the ``xfs_scrub`` package
+provides a suite of `systemd <https://systemd.io/>`_ timers and services that
+run online fsck automatically on weekends by default.
+The background service configures scrub to run with as little privilege as
+possible, the lowest CPU and IO priority, and in a CPU-constrained single
+threaded mode.
+This can be tuned by the systemd administrator at any time to suit the latency
+and throughput requirements of customer workloads.
+
+The output of the background service is also captured in the system log.
+If desired, reports of failures (either due to inconsistencies or mere runtime
+errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
+variable in the following service files:
+
+* ``xfs_scrub_fail@.service``
+* ``xfs_scrub_media_fail@.service``
+* ``xfs_scrub_all_fail.service``
+
+The decision to enable the background scan is left to the system administrator.
+This can be done by enabling either of the following services:
+
+* ``xfs_scrub_all.timer`` on systemd systems
+* ``xfs_scrub_all.cron`` on non-systemd systems
+
+This automatic weekly scan is configured out of the box to perform an
+additional media scan of all file data once per month.
+This is less foolproof than, say, storing file data block checksums, but much
+more performant if application software provides its own integrity checking,
+redundancy can be provided elsewhere above the filesystem, or the storage
+device's integrity guarantees are deemed sufficient.
+
+The systemd unit file definitions have been subjected to a security audit
+(as of systemd 249) to ensure that the xfs_scrub processes have as little
+access to the rest of the system as possible.
+This was performed via ``systemd-analyze security``, after which privileges
+were restricted to the minimum required, sandboxing was set up to the maximal
+extent possible with sandboxing and system call filtering; and access to the
+filesystem tree was restricted to the minimum needed to start the program and
+access the filesystem being scanned.
+The service definition files restrict CPU usage to 80% of one CPU core, and
+apply as nice of a priority to IO and CPU scheduling as possible.
+This measure was taken to minimize delays in the rest of the filesystem.
+No such hardening has been performed for the cron job.
+
+Proposed patchset:
+`Enabling the xfs_scrub background service
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
+
+Health Reporting
+----------------
+
+XFS caches a summary of each filesystem's health status in memory.
+The information is updated whenever ``xfs_scrub`` is run, or whenever
+inconsistencies are detected in the filesystem metadata during regular
+operations.
+System administrators should use the ``health`` command of ``xfs_spaceman`` to
+download this information into a human-readable format.
+If problems have been observed, the administrator can schedule a reduced
+service window to run the online repair tool to correct the problem.
+Failing that, the administrator can decide to schedule a maintenance window to
+run the traditional offline repair tool to correct the problem.
+
+**Future Work Question**: Should the health reporting integrate with the new
+inotify fs error notification system?
+Would it be helpful for sysadmins to have a daemon to listen for corruption
+notifications and initiate a repair?
+
+*Answer*: These questions remain unanswered, but should be a part of the
+conversation with early adopters and potential downstream users of XFS.
+
+Proposed patchsets include
+`wiring up health reports to correction returns
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
+and
+`preservation of sickness info during memory reclaim
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+Whenever a file system operation modifies a block, the change is submitted
+to the log as part of a transaction.
+The log then processes these transactions marking them done once they are
+safely persisted to storage.
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other discrepancies that can
+be introduced between the computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient encoding techniques.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Like any system that improves redundancy, the reverse-mapping feature increases
+overhead costs for space mapping activities.
+However, it has two critical advantages: first, the reverse index is key to
+enabling online fsck and other requested functionality such as free space
+defragmentation, better media failure reporting, and filesystem shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication because the filesystem requires real
+redundancy.
+
++--------------------------------------------------------------------------+
+| **Sidebar**: |
++--------------------------------------------------------------------------+
+| A criticism of adding the secondary index is that it does nothing to |
+| improve the robustness of user data storage itself. |
+| This is a valid point, but adding a new index for file data block |
+| checksums increases write amplification by turning data overwrites into |
+| copy-writes, which age the filesystem prematurely. |
+| In keeping with thirty years of precedent, users who want file data |
+| integrity can supply as powerful a solution as they require. |
+| As for metadata, the complexity of adding a new secondary index of space |
+| usage is much less than adding volume management and storage device |
+| mirroring to XFS itself. |
+| Perfection of RAID and volume management are best left to existing |
+| layers in the kernel. |
++--------------------------------------------------------------------------+
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+ struct xfs_rmap_irec {
+ xfs_agblock_t rm_startblock; /* extent start block */
+ xfs_extlen_t rm_blockcount; /* extent length */
+ uint64_t rm_owner; /* extent owner */
+ uint64_t rm_offset; /* offset within the owner */
+ unsigned int rm_flags; /* state flags */
+ };
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent? A file mapping btree extent? Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+ marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+There are several observations to make about reverse mapping indices:
+
+1. Reverse mappings can provide a positive affirmation of correctness if any of
+ the above primary metadata are in doubt.
+ The checking code for most primary metadata follows a path similar to the
+ one outlined above.
+
+2. Proving the consistency of secondary metadata with the primary metadata is
+ difficult because that requires a full scan of all primary space metadata,
+ which is very time intensive.
+ For example, checking a reverse mapping record for a file extent mapping
+ btree block requires locking the file and searching the entire btree to
+ confirm the block.
+ Instead, scrub relies on rigorous cross-referencing during the primary space
+ mapping structure checks.
+
+3. Consistency scans must use non-blocking lock acquisition primitives if the
+ required locking order is not the same order used by regular filesystem
+ operations.
+ For example, if the filesystem normally takes a file ILOCK before taking
+ the AGF buffer lock but scrub wants to take a file ILOCK while holding
+ an AGF buffer lock, scrub cannot block on that second acquisition.
+ This means that forward progress during this part of a scan of the reverse
+ mapping data cannot be guaranteed if system load is heavy.
+
+In summary, reverse mappings play a key role in reconstruction of primary
+metadata.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+ (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+ possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+ metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+ by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+ This assumes that metadata blocks only have one owner, which is always true
+ in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+ scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+After the buffer cache, the next level of metadata protection is the internal
+record verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+These higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+ the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+ labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+After internal block checks, the next higher level of checking is
+cross-referencing records between metadata structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+ of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+ of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+ follows:
+
+ - Does the reverse mapping index list only the appropriate owner as the
+ owner of each block?
+
+ - Are none of the blocks claimed as free space?
+
+ - If these aren't file data blocks, are none of the blocks claimed as space
+ shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+ - Everything in class 1 above.
+
+ - If there's a parent node block, do the keys listed for this block match the
+ keyspace of this block?
+
+ - Do the sibling pointers point to valid blocks? Of the same level?
+
+ - Do the child pointers point to valid blocks? Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Does the reverse mapping index list no owners of this space?
+
+ - Is this space not claimed by the inode index for inodes?
+
+ - Is it not mentioned by the reference count index?
+
+ - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Is there a matching record in free inode btree?
+
+ - Do cleared bits in the holemask correspond with inode clusters?
+
+ - Do set bits in the freemask correspond with inode records with zero link
+ count?
+
+5. Inode records are cross-referenced as follows:
+
+ - Everything in class 1.
+
+ - Do all the fields that summarize information about the file forks actually
+ match those forks?
+
+ - Does each inode with zero link count correspond to a record in the free
+ inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Is this space not mentioned by the inode btrees?
+
+ - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+ reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Within the space subkeyspace of the rmap btree (that is to say, all
+ records mapped to a particular space extent and ignoring the owner info),
+ are there the same number of reverse mapping records for each block as the
+ reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightforward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+ irregularities in the blocks or dabtree mappings that do not point to
+ attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+ For each entry inside a leaf:
+
+ a. Validate that the name does not contain invalid characters.
+
+ b. Read the attr value.
+ This performs a named lookup of the attr name to ensure the correctness
+ of the dabtree.
+ If the value is stored in a remote block, this also validates the
+ integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightforward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+ are no irregularities in the blocks or dabtree mappings that do not point to
+ dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+ Each dirent is checked as follows:
+
+ a. Does the name contain no invalid characters?
+
+ b. Does the inumber correspond to an actual, allocated inode?
+
+ c. Does the child inode have a nonzero link count?
+
+ d. If a file type is included in the dirent, does it match the type of the
+ inode?
+
+ e. If the child is a subdirectory, does the child's dotdot pointer point
+ back to the parent?
+
+ f. If the directory has a second partition, perform a named lookup of the
+ dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+ the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+ blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+ or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+ contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+ contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Complex operations can make modifications to multiple per-AG data structures
+with a chain of transactions.
+These chains, once committed to the log, are restarted during log recovery if
+the system crashes while processing the chain.
+Because the AG header buffers are unlocked between transactions within a chain,
+online checking must coordinate with chained operations that are in progress to
+avoid incorrectly detecting inconsistencies due to pending chains.
+Furthermore, online repair must not run when operations are pending because
+the metadata are temporarily inconsistent with each other, and rebuilding is
+not possible.
+
+Only online fsck has this requirement of total consistency of AG metadata, and
+should be relatively rare as compared to filesystem change operations.
+Online fsck coordinates with transaction chains as follows:
+
+* For each AG, maintain a count of intent items targeting that AG.
+ The count should be bumped whenever a new item is added to the chain.
+ The count should be dropped when the filesystem has locked the AG header
+ buffers and finished the work.
+
+* When online fsck wants to examine an AG, it should lock the AG header
+ buffers to quiesce all transaction chains that want to modify that AG.
+ If the count is zero, proceed with the checking operation.
+ If it is nonzero, cycle the buffer locks to allow the chain to make forward
+ progress.
+
+This may lead to online fsck taking a long time to complete, but regular
+filesystem updates take precedence over background checking activity.
+Details about the discovery of this situation are presented in the
+:ref:`next section <chain_coordination>`, and details about the solution
+are presented :ref:`after that<intent_drains>`.
+
+.. _chain_coordination:
+
+Discovery of the Problem
+````````````````````````
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+ structures to remove the mapping from the btree blocks.
+ It then attaches to the in-memory transaction an action item to schedule
+ deferred freeing of space.
+ Concretely, each transaction maintains a list of ``struct
+ xfs_defer_pending`` objects, each of which maintains a list of ``struct
+ xfs_extent_free_item`` objects.
+ Returning to the example above, the action item tracks the freeing of both
+ the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+ AG 3.
+ Deferred frees recorded in this manner are committed in the log by creating
+ an EFI log item from the ``struct xfs_extent_free_item`` object and
+ attaching the log item to the transaction.
+ When the log is persisted to disk, the EFI item is written into the ondisk
+ transaction record.
+ EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+ of AG 3 to release the former BMBT block and a second physical update to the
+ free space btrees of AG 7 to release the unmapped file space.
+ Observe that the physical updates are resequenced in the correct order
+ when possible.
+ Attached to the transaction is a an extent free done (EFD) log item.
+ The EFD contains a pointer to the EFI logged in transaction #1 so that log
+ recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are subtleties to XFS' transaction chaining strategy to consider:
+
+* Log items must be added to a transaction in the correct order to prevent
+ conflicts with principal objects that are not held by the transaction.
+ In other words, all per-AG metadata updates for an unmapped block must be
+ completed before the last update to free the extent, and extents should not
+ be reallocated until that last update commits to the log.
+
+* AG header buffers are released between each transaction in a chain.
+ This means that other threads can observe an AG in an intermediate state,
+ but as long as the first subtlety is handled, this should not affect the
+ correctness of filesystem operations.
+
+* Unmounting the filesystem flushes all pending work to disk, which means that
+ offline fsck never sees the temporary inconsistencies caused by deferred
+ work item processing.
+
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded. This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several other solutions to this problem were evaluated upon discovery of this
+flaw and rejected:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+ acquire the higher level lock in AG order before making any changes.
+ This would be very difficult to implement in practice because it is
+ difficult to determine which locks need to be obtained, and in what order,
+ without simulating the entire operation.
+ Performing a dry run of a file operation to discover necessary locks would
+ make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+ targeting the same AG and have it hold the AG header buffers locked across
+ the transaction roll between updates.
+ This would introduce a lot of complexity into the coordinator since it is
+ only loosely coupled with the actual deferred work items.
+ It would also fail to solve the problem because deferred work items can
+ generate new deferred subtasks, but all subtasks must be complete before
+ work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+ protect the data structure being scrubbed to look for pending operations.
+ The checking and repair operations must factor these pending operations into
+ the evaluations being performed.
+ This solution is a nonstarter because it is *extremely* invasive to the main
+ filesystem.
+
+.. _intent_drains:
+
+Intent Drains
+`````````````
+
+Online fsck uses an atomic intent item counter and lock cycling to coordinate
+with transaction chains.
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+ transaction.
+
+2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+ calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+ ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
+ waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+ intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+ For example, a scan of the refcount btree would lock the AGI and AGF header
+ buffers.
+
+2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
+ chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
+ back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+ defaults to false.
+ The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+ The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+ filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+ scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+ ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+ static key.
+ Wrapper functions make it easy to compile out the relevant code if the kernel
+ distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+ the ``xchk_fsgates_enable`` from the setup function to enable a specific
+ hook.
+ This must be done before obtaining any resources that are used by memory
+ reclaim.
+ Callers had better be sure they really need the functionality gated by the
+ static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Some online checking functions work by scanning the filesystem to build a
+shadow copy of an ondisk metadata structure in memory and comparing the two
+copies.
+For online repair to rebuild a metadata structure, it must compute the record
+set that will be stored in the new structure before it can persist that new
+structure to disk.
+Ideally, repairs complete with a single atomic commit that introduces
+a new data structure.
+To meet these goals, the kernel needs to collect a large amount of information
+in a place that doesn't require the correct operation of the filesystem.
+
+Kernel memory isn't suitable because:
+
+* Allocating a contiguous region of memory to create a C array is very
+ difficult, especially on 32-bit systems.
+
+* Linked lists of records introduce double pointer overhead which is very high
+ and eliminate the possibility of indexed lookups.
+
+* Kernel memory is pinned, which can drive the system into OOM conditions.
+
+* The system might not have sufficient memory to stage all the information.
+
+At any given time, online fsck does not need to keep the entire record set in
+memory, which means that individual records can be paged out if necessary.
+Continued development of online fsck demonstrated that the ability to perform
+indexed data storage would also be very useful.
+Fortunately, the Linux kernel already has a facility for byte-addressable and
+pageable storage: tmpfs.
+In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| The first edition of online repair inserted records into a new btree as |
+| it found them, which failed because filesystem could shut down with a |
+| built data structure, which would be live after recovery finished. |
+| |
+| The second edition solved the half-rebuilt structure problem by storing |
+| everything in memory, but frequently ran the system out of memory. |
+| |
+| The third edition solved the OOM problem by using linked lists, but the |
+| memory overhead of the list pointers was extreme. |
++--------------------------------------------------------------------------+
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+ extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+ attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+The most general storage interface supported by the xfile enables the reading
+and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
+This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
+which behave similarly to their userspace counterparts.
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
+functions are provided to read and persist objects into an xfile.
+They are internally the same as pread and pwrite, except that they treat any
+error as an out of memory error.
+For online repair, squashing error conditions in this manner is an acceptable
+behavior because the only reaction is to abort the operation back to userspace.
+All five xfile usecases can be serviced by these four functions.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It is convenient to access storage directly with pointers, just like userspace
+code does with regular memory.
+Online fsck must not drive the system into OOM conditions, which means that
+xfiles must be responsive to memory reclamation.
+tmpfs can only push a pagecache folio to the swap cache if the folio is neither
+pinned nor locked, which means the xfile must not pin too many folios.
+
+Short term direct access to xfile contents is done by locking the pagecache
+folio and mapping it into kernel address space.
+Programmatic access (e.g. pread and pwrite) uses this mechanism.
+Folio locks are not supposed to be held for long periods of time, so long
+term direct access to xfile contents is done by bumping the folio refcount,
+mapping it into kernel address space, and dropping the folio lock.
+These long term users *must* be responsive to memory reclaim by hooking into
+the shrinker infrastructure to know when to release folios.
+
+The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+retrieve the (locked) folio that backs part of an xfile and to release it.
+The only code to use these folio lease functions are the xfarray
+:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
+btrees<xfbtree>`.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
+before copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+For example, if a scrub function stores scan results in an xfile and needs
+other threads to provide updates to the scanned data, the scrub function must
+provide a lock for all threads to share.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+ xfarray_idx_t i;
+ foreach_xfarray_idx(array, i) {
+ xfarray_load(array, i, &rec);
+
+ /* do something with rec */
+ }
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+ xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+ while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+ /* do something with rec */
+ }
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+Case Study: Sorting xfarrays
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge against performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+ of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+.. _xfblob:
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic extent swapping.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+ pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+ ``xfbtree_create`` to write an initial tree header and root block to the
+ xfile.
+ Each btree type should define a wrapper that passes necessary arguments to
+ the creation function.
+ For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+ all the necessary details for callers.
+ A ``struct xfbtree`` object will be returned.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+ btree type.
+ Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+ for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+ and to update the in-memory btree.
+ For example, a btree cursor for an rmap xfbtree can be passed to the
+ ``xfs_rmap_*`` functions just like any other btree cursor.
+ See the :ref:`next section<xfbtree_commit>` for information on dealing with
+ xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+ buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+ that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+ are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.
+
+Bulk Loading of Ondisk B+Trees
+------------------------------
+
+As mentioned previously, early iterations of online repair built new btree
+structures by creating a new btree and adding observations individually.
+Loading a btree one record at a time had a slight advantage of not requiring
+the incore records to be sorted prior to commit, but was very slow and leaked
+blocks if the system went down during a repair.
+Loading records one at a time also meant that repair could not control the
+loading factor of the blocks in the new btree.
+
+Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
+rebuilding a btree index from a collection of records -- bulk btree loading.
+This was implemented rather inefficiently code-wise, since ``xfs_repair``
+had separate copy-pasted implementations for each btree type.
+
+To prepare for online fsck, each of the four bulk loaders were studied, notes
+were taken, and the four were refactored into a single generic btree bulk
+loading mechanism.
+Those notes in turn have been refreshed and are presented below.
+
+Geometry Computation
+````````````````````
+
+The zeroth step of bulk loading is to assemble the entire record set that will
+be stored in the new btree, and sort the records.
+Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
+btree from the record set, the type of btree, and any load factor preferences.
+This information is required for resource reservation.
+
+First, the geometry computation computes the minimum and maximum records that
+will fit in a leaf block from the size of a btree block and the size of the
+block header.
+Roughly speaking, the maximum number of records is::
+
+ maxrecs = (block_size - header_size) / record_size
+
+The XFS design specifies that btree blocks should be merged when possible,
+which means the minimum number of records is half of maxrecs::
+
+ minrecs = maxrecs / 2
+
+The next variable to determine is the desired loading factor.
+This must be at least minrecs and no more than maxrecs.
+Choosing minrecs is undesirable because it wastes half the block.
+Choosing maxrecs is also undesirable because adding a single record to each
+newly rebuilt leaf block will cause a tree split, which causes a noticeable
+drop in performance immediately afterwards.
+The default loading factor was chosen to be 75% of maxrecs, which provides a
+reasonably compact structure without any immediate split penalties::
+
+ default_load_factor = (maxrecs + minrecs) / 2
+
+If space is tight, the loading factor will be set to maxrecs to try to avoid
+running out of space::
+
+ leaf_load_factor = enough space ? default_load_factor : maxrecs
+
+Load factor is computed for btree node blocks using the combined size of the
+btree key and pointer as the record size::
+
+ maxrecs = (block_size - header_size) / (key_size + ptr_size)
+ minrecs = maxrecs / 2
+ node_load_factor = enough space ? default_load_factor : maxrecs
+
+Once that's done, the number of leaf blocks required to store the record set
+can be computed as::
+
+ leaf_blocks = ceil(record_count / leaf_load_factor)
+
+The number of node blocks needed to point to the next level down in the tree
+is computed as::
+
+ n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
+ node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
+
+The entire computation is performed recursively until the current level only
+needs one block.
+The resulting geometry is as follows:
+
+- For AG-rooted btrees, this level is the root level, so the height of the new
+ tree is ``level + 1`` and the space needed is the summation of the number of
+ blocks on each level.
+
+- For inode-rooted btrees where the records in the top level do not fit in the
+ inode fork area, the height is ``level + 2``, the space needed is the
+ summation of the number of blocks on each level, and the inode fork points to
+ the root block.
+
+- For inode-rooted btrees where the records in the top level can be stored in
+ the inode fork area, then the root block can be stored in the inode, the
+ height is ``level + 1``, and the space needed is one less than the summation
+ of the number of blocks on each level.
+ This only becomes relevant when non-bmap btrees gain the ability to root in
+ an inode, which is a future patchset and only included here for completeness.
+
+.. _newbt:
+
+Reserving New B+Tree Blocks
+```````````````````````````
+
+Once repair knows the number of blocks needed for the new btree, it allocates
+those blocks using the free space information.
+Each reserved extent is tracked separately by the btree builder state data.
+To improve crash resilience, the reservation code also logs an Extent Freeing
+Intent (EFI) item in the same transaction as each space allocation and attaches
+its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
+If the system goes down, log recovery will use the unfinished EFIs to free the
+unused space, the free space, leaving the filesystem unchanged.
+
+Each time the btree builder claims a block for the btree from a reserved
+extent, it updates the in-memory reservation to reflect the claimed space.
+Block reservation tries to allocate as much contiguous space as possible to
+reduce the number of EFIs in play.
+
+While repair is writing these new btree blocks, the EFIs created for the space
+reservations pin the tail of the ondisk log.
+It's possible that other parts of the system will remain busy and push the head
+of the log towards the pinned tail.
+To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
+for too long.
+To alleviate this problem, the dynamic relogging capability of the deferred ops
+mechanism is reused here to commit a transaction at the log head containing an
+EFD for the old EFI and new EFI at the head.
+This enables the log to release the old EFI to keep the log moving forwards.
+
+EFIs have a role to play during the commit and reaping phases; please see the
+next section and the section about :ref:`reaping<reaping>` for more details.
+
+Proposed patchsets are the
+`bitmap rework
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
+and the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
+
+
+Writing the New Tree
+````````````````````
+
+This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
+a block from the reserved list, writes the new btree block header, fills the
+rest of the block with records, and adds the new leaf block to a list of
+written blocks::
+
+ ┌────┐
+ │leaf│
+ │RRR │
+ └────┘
+
+Sibling pointers are set every time a new block is added to the level::
+
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+When it finishes writing the record leaf blocks, it moves on to the node
+blocks
+To fill a node block, it walks each block in the next level down in the tree
+to compute the relevant keys and write them into the parent node::
+
+ ┌────┐ ┌────┐
+ │node│──────→│node│
+ │PP │←──────│PP │
+ └────┘ └────┘
+ ↙ ↘ ↙ ↘
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+When it reaches the root level, it is ready to commit the new btree!::
+
+ ┌─────────┐
+ │ root │
+ │ PP │
+ └─────────┘
+ ↙ ↘
+ ┌────┐ ┌────┐
+ │node│──────→│node│
+ │PP │←──────│PP │
+ └────┘ └────┘
+ ↙ ↘ ↙ ↘
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+The first step to commit the new btree is to persist the btree blocks to disk
+synchronously.
+This is a little complicated because a new btree block could have been freed
+in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
+remove the (stale) buffer from the AIL list before it can write the new blocks
+to disk.
+Blocks are queued for IO using a delwri list and written in one large batch
+with ``xfs_buf_delwri_submit``.
+
+Once the new blocks have been persisted to disk, control returns to the
+individual repair function that called the bulk loader.
+The repair function must log the location of the new root in a transaction,
+clean up the space reservations that were made for the new btree, and reap the
+old metadata blocks:
+
+1. Commit the location of the new btree root.
+
+2. For each incore reservation:
+
+ a. Log Extent Freeing Done (EFD) items for all the space that was consumed
+ by the btree builder. The new EFDs must point to the EFIs attached to
+ the reservation to prevent log recovery from freeing the new blocks.
+
+ b. For unclaimed portions of incore reservations, create a regular deferred
+ extent free work item to be free the unused space later in the
+ transaction chain.
+
+ c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
+ reservation of the committing transaction.
+ If the btree loading code suspects this might be about to happen, it must
+ call ``xrep_defer_finish`` to clear out the deferred work and obtain a
+ fresh transaction.
+
+3. Clear out the deferred work a second time to finish the commit and clean
+ the repair transaction.
+
+The transaction rolling in steps 2c and 3 represent a weakness in the repair
+algorithm, because a log flush and a crash before the end of the reap step can
+result in space leaking.
+Online repair functions minimize the chances of this occurring by using very
+large transactions, which each can accommodate many thousands of block freeing
+instructions.
+Repair moves on to reaping the old blocks, which will be presented in a
+subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
+
+Case Study: Rebuilding the Inode Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the inode index btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
+ records from the inode chunk information and a bitmap of the old inode btree
+ blocks.
+
+2. Append the records to an xfarray in inode order.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the inode btree.
+ If the free space inode btree is enabled, call it again to estimate the
+ geometry of the finobt.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+ If the free space inode btree is enabled, call it again to load the finobt.
+
+6. Commit the location of the new btree root block(s) to the AGI.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows.
+
+The inode btree maps inumbers to the ondisk location of the associated
+inode records, which means that the inode btrees can be rebuilt from the
+reverse mapping information.
+Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
+location of the old inode btree blocks.
+Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
+location of at least one inode cluster buffer.
+A cluster is the smallest number of ondisk inodes that can be allocated or
+freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
+
+For the space represented by each inode cluster, ensure that there are no
+records in the free space btrees nor any records in the reference count btree.
+If there are, the space metadata inconsistencies are reason enough to abort the
+operation.
+Otherwise, read each cluster buffer to check that its contents appear to be
+ondisk inodes and to decide if the file is allocated
+(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
+Accumulate the results of successive inode cluster buffer reads until there is
+enough information to fill a single inode chunk record, which is 64 consecutive
+numbers in the inumber keyspace.
+If the chunk is sparse, the chunk record may include holes.
+
+Once the repair function accumulates one chunk's worth of data, it calls
+``xfarray_append`` to add the inode btree record to the xfarray.
+This xfarray is walked twice during the btree creation step -- once to populate
+the inode btree with all inode chunk records, and a second time to populate the
+free inode btree with records for chunks that have free non-sparse inodes.
+The number of records for the inode btree is the number of xfarray records,
+but the record count for the free inode btree has to be computed as inode chunk
+records are stored in the xfarray.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the Space Reference Counts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reverse mapping records are used to rebuild the reference count information.
+Reference counts are required for correct operation of copy on write for shared
+file data.
+Imagine the reverse mapping entries as rectangles representing extents of
+physical blocks, and that the rectangles can be laid down to allow them to
+overlap each other.
+From the diagram below, it is apparent that a reference count record must start
+or end wherever the height of the stack changes.
+In other words, the record emission stimulus is level-triggered::
+
+ █ ███
+ ██ █████ ████ ███ ██████
+ ██ ████ ███████████ ████ █████████
+ ████████████████████████████████ ███████████
+ ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^
+ 2 1 23 21 3 43 234 2123 1 01 2 3 0
+
+The ondisk reference count btree does not store the refcount == 0 cases because
+the free space btree already records which blocks are free.
+Extents being used to stage copy-on-write operations should be the only records
+with refcount == 1.
+Single-owner file blocks aren't recorded in either the free space or the
+reference count btrees.
+
+The high level process to rebuild the reference count btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
+ records for any space having more than one reverse mapping and add them to
+ the xfarray.
+ Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
+ because these are extents allocated to stage a copy on write operation and
+ are tracked in the refcount btree.
+
+ Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
+ refcount btree blocks.
+
+2. Sort the records in physical extent order, putting the CoW staging extents
+ at the end of the xfarray.
+ This matches the sorting order of records in the refcount btree.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the new tree.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+
+6. Commit the location of new btree root block to the AGF.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows; the same algorithm is used by ``xfs_repair`` to
+generate refcount information from reverse mapping records.
+
+- Until the reverse mapping btree runs out of records:
+
+ - Retrieve the next record from the btree and put it in a bag.
+
+ - Collect all records with the same starting block from the btree and put
+ them in the bag.
+
+ - While the bag isn't empty:
+
+ - Among the mappings in the bag, compute the lowest block number where the
+ reference count changes.
+ This position will be either the starting block number of the next
+ unprocessed reverse mapping or the next block after the shortest mapping
+ in the bag.
+
+ - Remove all mappings from the bag that end at this position.
+
+ - Collect all reverse mappings that start at this position from the btree
+ and put them in the bag.
+
+ - If the size of the bag changed and is greater than one, create a new
+ refcount record associating the block number range that we just walked to
+ the size of the bag.
+
+The bag-like structure in this case is a type 2 xfarray as discussed in the
+:ref:`xfarray access patterns<xfarray_access_patterns>` section.
+Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
+removed via ``xfarray_unset``.
+Bag members are examined through ``xfarray_iter`` loops.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding File Fork Mapping Indices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild a data/attr fork mapping btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
+ records from the reverse mapping records for that inode and fork.
+ Append these records to an xfarray.
+ Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
+ records.
+
+2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the new tree.
+
+3. Sort the records in file offset order.
+
+4. If the extent records would fit in the inode fork immediate area, commit the
+ records to that immediate area and skip to step 8.
+
+5. Allocate the number of blocks computed in the previous step.
+
+6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+
+7. Commit the new btree root block to the inode fork immediate area.
+
+8. Reap the old btree blocks using the bitmap created in step 1.
+
+There are some complications here:
+First, it's possible to move the fork offset to adjust the sizes of the
+immediate areas if the data and attr forks are not both in BMBT format.
+Second, if there are sufficiently few fork mappings, it may be possible to use
+EXTENTS format instead of BMBT, which may require a conversion.
+Third, the incore extent map must be reloaded carefully to avoid disturbing
+any delayed allocation extents.
+
+The proposed patchset is the
+`file mapping repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
+series.
+
+.. _reaping:
+
+Reaping Old Metadata Blocks
+---------------------------
+
+Whenever online fsck builds a new data structure to replace one that is
+suspect, there is a question of how to find and dispose of the blocks that
+belonged to the old structure.
+The laziest method of course is not to deal with them at all, but this slowly
+leads to service degradations as space leaks out of the filesystem.
+Hopefully, someone will schedule a rebuild of the free space information to
+plug all those leaks.
+Offline repair rebuilds all space metadata after recording the usage of
+the files and directories that it decides not to clear, hence it can build new
+structures in the discovered free space and avoid the question of reaping.
+
+As part of a repair, online fsck relies heavily on the reverse mapping records
+to find space that is owned by the corresponding rmap owner yet truly free.
+Cross referencing rmap records with other rmap records is necessary because
+there may be other data structures that also think they own some of those
+blocks (e.g. crosslinked trees).
+Permitting the block allocator to hand them out again will not push the system
+towards consistency.
+
+For space metadata, the process of finding extents to dispose of generally
+follows this format:
+
+1. Create a bitmap of space used by data structures that must be preserved.
+ The space reservations used to create the new metadata can be used here if
+ the same rmap owner code is used to denote all of the objects being rebuilt.
+
+2. Survey the reverse mapping data to create a bitmap of space owned by the
+ same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
+
+3. Use the bitmap disunion operator to subtract (1) from (2).
+ The remaining set bits represent candidate extents that could be freed.
+ The process moves on to step 4 below.
+
+Repairs for file-based metadata such as extended attributes, directories,
+symbolic links, quota files and realtime bitmaps are performed by building a
+new structure attached to a temporary file and swapping the forks.
+Afterward, the mappings in the old file fork are the candidate blocks for
+disposal.
+
+The process for disposing of old extents is as follows:
+
+4. For each candidate extent, count the number of reverse mapping records for
+ the first block in that extent that do not have the same rmap owner for the
+ data structure being repaired.
+
+ - If zero, the block has a single owner and can be freed.
+
+ - If not, the block is part of a crosslinked structure and must not be
+ freed.
+
+5. Starting with the next block in the extent, figure out how many more blocks
+ have the same zero/nonzero other owner status as that first block.
+
+6. If the region is crosslinked, delete the reverse mapping entry for the
+ structure being repaired and move on to the next region.
+
+7. If the region is to be freed, mark any corresponding buffers in the buffer
+ cache as stale to prevent log writeback.
+
+8. Free the region and move on.
+
+However, there is one complication to this procedure.
+Transactions are of finite size, so the reaping process must be careful to roll
+the transactions to avoid overruns.
+Overruns come from two sources:
+
+a. EFIs logged on behalf of space that is no longer occupied
+
+b. Log items for buffer invalidations
+
+This is also a window in which a crash during the reaping process can leak
+blocks.
+As stated earlier, online repair functions use very large transactions to
+minimize the chances of this occurring.
+
+The proposed patchset is the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
+series.
+
+Case Study: Reaping After a Regular Btree Repair
+````````````````````````````````````````````````
+
+Old reference count and inode btrees are the easiest to reap because they have
+rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
+btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
+Creating a list of extents to reap the old btree blocks is quite simple,
+conceptually:
+
+1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
+
+2. For each reverse mapping record with an rmap owner corresponding to the
+ metadata structure being rebuilt, set the corresponding range in a bitmap.
+
+3. Walk the current data structures that have the same rmap owner.
+ For each block visited, clear that range in the above bitmap.
+
+4. Each set bit in the bitmap represents a block that could be a block from the
+ old data structures and hence is a candidate for reaping.
+ In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
+ are the blocks that might be freeable.
+
+If it is possible to maintain the AGF lock throughout the repair (which is the
+common case), then step 2 can be performed at the same time as the reverse
+mapping record walk that creates the records for the new btree.
+
+Case Study: Rebuilding the Free Space Indices
+`````````````````````````````````````````````
+
+The high level process to rebuild the free space indices is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
+ records from the gaps in the reverse mapping btree.
+
+2. Append the records to an xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for each new tree.
+
+4. Allocate the number of blocks computed in the previous step from the free
+ space information collected.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks for the free space by length index.
+ Call it again for the free space by block number index.
+
+6. Commit the locations of the new btree root blocks to the AGF.
+
+7. Reap the old btree blocks by looking for space that is not recorded by the
+ reverse mapping btree, the new free space btrees, or the AGFL.
+
+Repairing the free space btrees has three key complications over a regular
+btree repair:
+
+First, free space is not explicitly tracked in the reverse mapping records.
+Hence, the new free space records must be inferred from gaps in the physical
+space component of the keyspace of the reverse mapping btree.
+
+Second, free space repairs cannot use the common btree reservation code because
+new blocks are reserved out of the free space btrees.
+This is impossible when repairing the free space btrees themselves.
+However, repair holds the AGF buffer lock for the duration of the free space
+index reconstruction, so it can use the collected free space information to
+supply the blocks for the new free space btrees.
+It is not necessary to back each reserved extent with an EFI because the new
+free space btrees are constructed in what the ondisk filesystem thinks is
+unowned space.
+However, if reserving blocks for the new btrees from the collected free space
+information changes the number of free space records, repair must re-estimate
+the new free space btree geometry with the new record count until the
+reservation is sufficient.
+As part of committing the new btrees, repair must ensure that reverse mappings
+are created for the reserved blocks and that unused reserved blocks are
+inserted into the free space btrees.
+Deferrred rmap and freeing operations are used to ensure that this transition
+is atomic, similar to the other btree repair functions.
+
+Third, finding the blocks to reap after the repair is not overly
+straightforward.
+Blocks for the free space btrees and the reverse mapping btrees are supplied by
+the AGFL.
+Blocks put onto the AGFL have reverse mapping records with the owner
+``XFS_RMAP_OWN_AG``.
+This ownership is retained when blocks move from the AGFL into the free space
+btrees or the reverse mapping btrees.
+When repair walks reverse mapping records to synthesize free space records, it
+creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
+``XFS_RMAP_OWN_AG`` records.
+The repair context maintains a second bitmap corresponding to the rmap btree
+blocks and the AGFL blocks (``rmap_agfl_bitmap``).
+When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
+~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
+btrees.
+These blocks can then be reaped using the methods outlined above.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+.. _rmap_reap:
+
+Case Study: Reaping After Repairing Reverse Mapping Btrees
+``````````````````````````````````````````````````````````
+
+Old reverse mapping btrees are less difficult to reap after a repair.
+As mentioned in the previous section, blocks on the AGFL, the two free space
+btree blocks, and the reverse mapping btree blocks all have reverse mapping
+records with ``XFS_RMAP_OWN_AG`` as the owner.
+The full process of gathering reverse mapping records and building a new btree
+are described in the case study of
+:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
+discussion is that the new rmap btree will not contain any records for the old
+rmap btree, nor will the old btree blocks be tracked in the free space btrees.
+The list of candidate reaping blocks is computed by setting the bits
+corresponding to the gaps in the new rmap btree records, and then clearing the
+bits corresponding to extents in the free space btrees and the current AGFL
+blocks.
+The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
+methods outlined above.
+
+The rest of the process of rebuildng the reverse mapping btree is discussed
+in a separate :ref:`case study<rmap_repair>`.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the AGFL
+```````````````````````````````
+
+The allocation group free block list (AGFL) is repaired as follows:
+
+1. Create a bitmap for all the space that the reverse mapping data claims is
+ owned by ``XFS_RMAP_OWN_AG``.
+
+2. Subtract the space used by the two free space btrees and the rmap btree.
+
+3. Subtract any space that the reverse mapping data claims is owned by any
+ other owner, to avoid re-adding crosslinked blocks to the AGFL.
+
+4. Once the AGFL is full, reap any blocks leftover.
+
+5. The next operation to fix the freelist will right-size the list.
+
+See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilized while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze, though not all of the pieces are
+necessary:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+ prevent other threads from thawing the filesystem, or other scrub threads
+ from initiating another fscounters freeze.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| The initial implementation used the actual VFS filesystem freeze |
+| mechanism to quiesce filesystem activity. |
+| With the filesystem frozen, it is possible to resolve the counter values |
+| with exact precision, but there are many problems with calling the VFS |
+| methods directly: |
+| |
+| - Other programs can unfreeze the filesystem without our knowledge. |
+| This leads to incorrect scan results and incorrect repairs. |
+| |
+| - Adding an extra lock to prevent others from thawing the filesystem |
+| required the addition of a ``->freeze_super`` function to wrap |
+| ``freeze_fs()``. |
+| This in turn caused other subtle problems because it turns out that |
+| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the |
+| last reference to the VFS superblock, and any subsequent access |
+| becomes a UAF bug! |
+| This can happen if the filesystem is unmounted while the underlying |
+| block device has frozen the filesystem. |
+| This problem could be solved by grabbing extra references to the |
+| superblock, but it felt suboptimal given the other inadequacies of |
+| this approach. |
+| |
+| - The log need not be quiesced to check the summary counters, but a VFS |
+| freeze initiates one anyway. |
+| This adds unnecessary runtime to live fscounter fsck operations. |
+| |
+| - Quiescing the log means that XFS flushes the (possibly incorrect) |
+| counters to disk as part of cleaning the log. |
+| |
+| - A bug in the VFS meant that freeze could complete even when |
+| sync_filesystem fails to flush the filesystem and returns an error. |
+| This bug was fixed in Linux 5.17. |
++--------------------------------------------------------------------------+
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.
+
+Full Filesystem Scans
+---------------------
+
+Certain types of metadata can only be checked by walking every file in the
+entire filesystem to record observations and comparing the observations against
+what's recorded on disk.
+Like every other type of online repair, repairs are made by writing those
+observations to disk in a replacement structure and committing it atomically.
+However, it is not practical to shut down the entire filesystem to examine
+hundreds of billions of files because the downtime would be excessive.
+Therefore, online fsck must build the infrastructure to manage a live scan of
+all the files in the filesystem.
+There are two questions that need to be solved to perform a live walk:
+
+- How does scrub manage the scan while it is collecting data?
+
+- How does the scan keep abreast of changes being made to the system by other
+ threads?
+
+.. _iscan:
+
+Coordinated Inode Scans
+```````````````````````
+
+In the original Unix filesystems of the 1970s, each directory entry contained
+an index number (*inumber*) which was used as an index into on ondisk array
+(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
+its data block mapping.
+This system is described by J. Lions, `"inode (5659)"
+<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
+UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
+Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
+`"Implementation of the File System"
+<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
+Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
+1913-4.
+
+XFS retains most of this design, except now inumbers are search keys over all
+the space in the data section filesystem.
+They form a continuous keyspace that can be expressed as a 64-bit integer,
+though the inodes themselves are sparsely distributed within the keyspace.
+Scans proceed in a linear fashion across the inumber keyspace, starting from
+``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
+Naturally, a scan through a keyspace requires a scan cursor object to track the
+scan progress.
+Because this keyspace is sparse, this cursor contains two parts.
+The first part of this scan cursor object tracks the inode that will be
+examined next; call this the examination cursor.
+Somewhat less obviously, the scan cursor object must also track which parts of
+the keyspace have already been visited, which is critical for deciding if a
+concurrent filesystem update needs to be incorporated into the scan data.
+Call this the visited inode cursor.
+
+Advancing the scan cursor is a multi-step process encapsulated in
+``xchk_iscan_iter``:
+
+1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
+ inode cursor.
+ This guarantee that inodes in this AG cannot be allocated or freed while
+ advancing the cursor.
+
+2. Use the per-AG inode btree to look up the next inumber after the one that
+ was just visited, since it may not be keyspace adjacent.
+
+3. If there are no more inodes left in this AG:
+
+ a. Move the examination cursor to the point of the inumber keyspace that
+ corresponds to the start of the next AG.
+
+ b. Adjust the visited inode cursor to indicate that it has "visited" the
+ last possible inode in the current AG's inode keyspace.
+ XFS inumbers are segmented, so the cursor needs to be marked as having
+ visited the entire keyspace up to just before the start of the next AG's
+ inode keyspace.
+
+ c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
+ filesystem.
+
+ d. If there are no more AGs to examine, set both cursors to the end of the
+ inumber keyspace.
+ The scan is now complete.
+
+4. Otherwise, there is at least one more inode to scan in this AG:
+
+ a. Move the examination cursor ahead to the next inode marked as allocated
+ by the inode btree.
+
+ b. Adjust the visited inode cursor to point to the inode just prior to where
+ the examination cursor is now.
+ Because the scanner holds the AGI buffer lock, no inodes could have been
+ created in the part of the inode keyspace that the visited inode cursor
+ just advanced.
+
+5. Get the incore inode for the inumber of the examination cursor.
+ By maintaining the AGI buffer lock until this point, the scanner knows that
+ it was safe to advance the examination cursor across the entire keyspace,
+ and that it has stabilized this next inode so that it cannot disappear from
+ the filesystem until the scan releases the incore inode.
+
+6. Drop the AGI lock and return the incore inode to the caller.
+
+Online fsck functions scan all files in the filesystem as follows:
+
+1. Start a scan by calling ``xchk_iscan_start``.
+
+2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
+ If one is provided:
+
+ a. Lock the inode to prevent updates during the scan.
+
+ b. Scan the inode.
+
+ c. While still holding the inode lock, adjust the visited inode cursor
+ (``xchk_iscan_mark_visited``) to point to this inode.
+
+ d. Unlock and release the inode.
+
+8. Call ``xchk_iscan_teardown`` to complete the scan.
+
+There are subtleties with the inode cache that complicate grabbing the incore
+inode for the caller.
+Obviously, it is an absolute requirement that the inode metadata be consistent
+enough to load it into the inode cache.
+Second, if the incore inode is stuck in some intermediate state, the scan
+coordinator must release the AGI and push the main filesystem to get the inode
+back into a loadable state.
+
+The proposed patches are the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+The first user of the new functionality is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+Inode Management
+````````````````
+
+In regular filesystem code, references to allocated XFS incore inodes are
+always obtained (``xfs_iget``) outside of transaction context because the
+creation of the incore context for an existing file does not require metadata
+updates.
+However, it is important to note that references to incore inodes obtained as
+part of file creation must be performed in transaction context because the
+filesystem must ensure the atomicity of the ondisk inode btree index updates
+and the initialization of the actual ondisk inode.
+
+References to incore inodes are always released (``xfs_irele``) outside of
+transaction context because there are a handful of activities that might
+require ondisk updates:
+
+- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
+ release.
+
+- Speculative preallocations need to be unreserved.
+
+- An unlinked file may have lost its last reference, in which case the entire
+ file must be inactivated, which involves releasing all of its resources in
+ the ondisk metadata and freeing the inode.
+
+These activities are collectively called inode inactivation.
+Inactivation has two parts -- the VFS part, which initiates writeback on all
+dirty file pages, and the XFS part, which cleans up XFS-specific information
+and frees the inode if it was unlinked.
+If the inode is unlinked (or unconnected after a file handle operation), the
+kernel drops the inode into the inactivation machinery immediately.
+
+During normal operation, resource acquisition for an update follows this order
+to avoid deadlocks:
+
+1. Inode reference (``iget``).
+
+2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
+
+3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
+
+4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
+ can update page cache mappings.
+
+5. Log feature enablement.
+
+6. Transaction log space grant.
+
+7. Space on the data and realtime devices for the transaction.
+
+8. Incore dquot references, if a file is being repaired.
+ Note that they are not locked, merely acquired.
+
+9. Inode ``ILOCK`` for file metadata updates.
+
+10. AG header buffer locks / Realtime metadata inode ILOCK.
+
+11. Realtime metadata buffer locks, if applicable.
+
+12. Extent mapping btree blocks, if applicable.
+
+Resources are often released in the reverse order, though this is not required.
+However, online fsck differs from regular XFS operations because it may examine
+an object that normally is acquired in a later stage of the locking order, and
+then decide to cross-reference the object with an object that is acquired
+earlier in the order.
+The next few sections detail the specific ways in which online fsck takes care
+to avoid deadlocks.
+
+iget and irele During a Scrub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An inode scan performed on behalf of a scrub operation runs in transaction
+context, and possibly with resources already locked and bound to it.
+This isn't much of a problem for ``iget`` since it can operate in the context
+of an existing transaction, as long as all of the bound resources are acquired
+before the inode reference in the regular filesystem.
+
+When the VFS ``iput`` function is given a linked inode with no other
+references, it normally puts the inode on an LRU list in the hope that it can
+save time if another process re-opens the file before the system runs out
+of memory and frees it.
+Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
+flag on the inode to cause the kernel to try to drop the inode into the
+inactivation machinery immediately.
+
+In the past, inactivation was always done from the process that dropped the
+inode, which was a problem for scrub because scrub may already hold a
+transaction, and XFS does not support nesting transactions.
+On the other hand, if there is no scrub transaction, it is desirable to drop
+otherwise unused inodes immediately to avoid polluting caches.
+To capture these nuances, the online fsck code has a separate ``xchk_irele``
+function to set or clear the ``DONTCACHE`` flag to get the required release
+behavior.
+
+Proposed patchsets include fixing
+`scrub iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
+`dir iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
+
+.. _ilocking:
+
+Locking Inodes
+^^^^^^^^^^^^^^
+
+In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
+in a well-known order: parent → child when updating the directory tree, and
+in numerical order of the addresses of their ``struct inode`` object otherwise.
+For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
+faults.
+If two MMAPLOCKs must be acquired, they are acquired in numerical order of
+the addresses of their ``struct address_space`` objects.
+Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
+acquired before transactions are allocated.
+If two ILOCKs must be acquired, they are acquired in inumber order.
+
+Inode lock acquisition must be done carefully during a coordinated inode scan.
+Online fsck cannot abide these conventions, because for a directory tree
+scanner, the scrub process holds the IOLOCK of the file being scanned and it
+needs to take the IOLOCK of the file at the other end of the directory link.
+If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
+cannot use the regular inode locking functions and avoid becoming trapped in an
+ABBA deadlock.
+
+Solving both of these problems is straightforward -- any time online fsck
+needs to take a second lock of the same class, it uses trylock to avoid an ABBA
+deadlock.
+If the trylock fails, scrub drops all inode locks and use trylock loops to
+(re)acquire all necessary resources.
+Trylock loops enable scrub to check for pending fatal signals, which is how
+scrub avoids deadlocking the filesystem or becoming an unresponsive process.
+However, trylock loops means that online fsck must be prepared to measure the
+resource being scrubbed before and after the lock cycle to detect changes and
+react accordingly.
+
+.. _dirparent:
+
+Case Study: Finding a Directory Parent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider the directory parent pointer repair code as an example.
+Online fsck must verify that the dotdot dirent of a directory points up to a
+parent directory, and that the parent directory contains exactly one dirent
+pointing down to the child directory.
+Fully validating this relationship (and repairing it if possible) requires a
+walk of every directory on the filesystem while holding the child locked, and
+while updates to the directory tree are being made.
+The coordinated inode scan provides a way to walk the filesystem without the
+possibility of missing an inode.
+The child directory is kept locked to prevent updates to the dotdot dirent, but
+if the scanner fails to lock a parent, it can drop and relock both the child
+and the prospective parent.
+If the dotdot entry changes while the directory is unlocked, then a move or
+rename operation must have changed the child's parentage, and the scan can
+exit early.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _fshooks:
+
+Filesystem Hooks
+`````````````````
+
+The second piece of support that online fsck functions need during a full
+filesystem scan is the ability to stay informed about updates being made by
+other threads in the filesystem, since comparisons against the past are useless
+in a dynamic environment.
+Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
+filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
+
+Filesystem hooks convey information about an ongoing filesystem operation to
+a downstream consumer.
+In this case, the downstream consumer is always an online fsck function.
+Because multiple fsck functions can run in parallel, online fsck uses the Linux
+notifier call chain facility to dispatch updates to any number of interested
+fsck processes.
+Call chains are a dynamic list, which means that they can be configured at
+run time.
+Because these hooks are private to the XFS module, the information passed along
+contains exactly what the checking function needs to update its observations.
+
+The current implementation of XFS hooks uses SRCU notifier chains to reduce the
+impact to highly threaded workloads.
+Regular blocking notifier chains use a rwsem and seem to have a much lower
+overhead for single-threaded applications.
+However, it may turn out that the combination of blocking chains and static
+keys are a more performant combination; more study is needed here.
+
+The following pieces are necessary to hook a certain point in the filesystem:
+
+- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
+ a well-known incore filesystem object.
+
+- Each hook must define an action code and a structure containing more context
+ about the action.
+
+- Hook providers should provide appropriate wrapper functions and structs
+ around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
+ checking to ensure correct usage.
+
+- A callsite in the regular filesystem code must be chosen to call
+ ``xfs_hooks_call`` with the action code and data structure.
+ This place should be adjacent to (and not earlier than) the place where
+ the filesystem update is committed to the transaction.
+ In general, when the filesystem calls a hook chain, it should be able to
+ handle sleeping and should not be vulnerable to memory reclaim or locking
+ recursion.
+ However, the exact requirements are very dependent on the context of the hook
+ caller and the callee.
+
+- The online fsck function should define a structure to hold scan data, a lock
+ to coordinate access to the scan data, and a ``struct xfs_hook`` object.
+ The scanner function and the regular filesystem code must acquire resources
+ in the same order; see the next section for details.
+
+- The online fsck code must contain a C function to catch the hook action code
+ and data structure.
+ If the object being updated has already been visited by the scan, then the
+ hook information must be applied to the scan data.
+
+- Prior to unlocking inodes to start the scan, online fsck must call
+ ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
+ ``xfs_hooks_add`` to enable the hook.
+
+- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
+ complete.
+
+The number of hooks should be kept to a minimum to reduce complexity.
+Static keys are used to reduce the overhead of filesystem hooks to nearly
+zero when online fsck is not running.
+
+.. _liveupdate:
+
+Live Updates During a Scan
+``````````````````````````
+
+The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
+filesystem code look like this::
+
+ other program
+ ↓
+ inode lock ←────────────────────┐
+ ↓ │
+ AG header lock │
+ ↓ │
+ filesystem function │
+ ↓ │
+ notifier call chain │ same
+ ↓ ├─── inode
+ scrub hook function │ lock
+ ↓ │
+ scan data mutex ←──┐ same │
+ ↓ ├─── scan │
+ update scan data │ lock │
+ ↑ │ │
+ scan data mutex ←──┘ │
+ ↑ │
+ inode lock ←────────────────────┘
+ ↑
+ scrub function
+ ↑
+ inode scanner
+ ↑
+ xfs_scrub
+
+These rules must be followed to ensure correct interactions between the
+checking code and the code making an update to the filesystem:
+
+- Prior to invoking the notifier call chain, the filesystem function being
+ hooked must acquire the same lock that the scrub scanning function acquires
+ to scan the inode.
+
+- The scanning function and the scrub hook function must coordinate access to
+ the scan data by acquiring a lock on the scan data.
+
+- Scrub hook function must not add the live update information to the scan
+ observations unless the inode being updated has already been scanned.
+ The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
+ for this.
+
+- Scrub hook functions must not change the caller's state, including the
+ transaction that it is running.
+ They must not acquire any resources that might conflict with the filesystem
+ function being hooked.
+
+- The hook function can abort the inode scan to avoid breaking the other rules.
+
+The inode scan APIs are pretty simple:
+
+- ``xchk_iscan_start`` starts a scan
+
+- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
+ returns zero if there is nothing left to scan
+
+- ``xchk_iscan_want_live_update`` to decide if an inode has already been
+ visited in the scan.
+ This is critical for hook functions to decide if they need to update the
+ in-memory scan information.
+
+- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
+ scan
+
+- ``xchk_iscan_teardown`` to finish the scan
+
+This functionality is also a part of the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+
+.. _quotacheck:
+
+Case Study: Quota Counter Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is useful to compare the mount time quotacheck code to the online repair
+quotacheck code.
+Mount time quotacheck does not have to contend with concurrent operations, so
+it does the following:
+
+1. Make sure the ondisk dquots are in good enough shape that all the incore
+ dquots will actually load, and zero the resource usage counters in the
+ ondisk buffer.
+
+2. Walk every inode in the filesystem.
+ Add each file's resource usage to the incore dquot.
+
+3. Walk each incore dquot.
+ If the incore dquot is not being flushed, add the ondisk buffer backing the
+ incore dquot to a delayed write (delwri) list.
+
+4. Write the buffer list to disk.
+
+Like most online fsck functions, online quotacheck can't write to regular
+filesystem objects until the newly collected metadata reflect all filesystem
+state.
+Therefore, online quotacheck records file resource usage to a shadow dquot
+index implemented with a sparse ``xfarray``, and only writes to the real dquots
+once the scan is complete.
+Handling transactional updates is tricky because quota resource usage updates
+are handled in phases to minimize contention on dquots:
+
+1. The inodes involved are joined and locked to a transaction.
+
+2. For each dquot attached to the file:
+
+ a. The dquot is locked.
+
+ b. A quota reservation is added to the dquot's resource usage.
+ The reservation is recorded in the transaction.
+
+ c. The dquot is unlocked.
+
+3. Changes in actual quota usage are tracked in the transaction.
+
+4. At transaction commit time, each dquot is examined again:
+
+ a. The dquot is locked again.
+
+ b. Quota usage changes are logged and unused reservation is given back to
+ the dquot.
+
+ c. The dquot is unlocked.
+
+For online quotacheck, hooks are placed in steps 2 and 4.
+The step 2 hook creates a shadow version of the transaction dquot context
+(``dqtrx``) that operates in a similar manner to the regular code.
+The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
+Notice that both hooks are called with the inode locked, which is how the
+live update coordinates with the inode scanner.
+
+The quotacheck scan looks like this:
+
+1. Set up a coordinated inode scan.
+
+2. For each inode returned by the inode scan iterator:
+
+ a. Grab and lock the inode.
+
+ b. Determine that inode's resource usage (data blocks, inode counts,
+ realtime blocks) and add that to the shadow dquots for the user, group,
+ and project ids associated with the inode.
+
+ c. Unlock and release the inode.
+
+3. For each dquot in the system:
+
+ a. Grab and lock the dquot.
+
+ b. Check the dquot against the shadow dquots created by the scan and updated
+ by the live hooks.
+
+Live updates are key to being able to walk every quota record without
+needing to hold any locks for a long duration.
+If repairs are desired, the real and shadow dquots are locked and their
+resource counts are set to the values in the shadow dquot.
+
+The proposed patchset is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+.. _nlinks:
+
+Case Study: File Link Count Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+File link count checking also uses live update hooks.
+The coordinated inode scanner is used to visit all directories on the
+filesystem, and per-file link count records are stored in a sparse ``xfarray``
+indexed by inumber.
+During the scanning phase, each entry in a directory generates observation
+data as follows:
+
+1. If the entry is a dotdot (``'..'``) entry of the root directory, the
+ directory's parent link count is bumped because the root directory's dotdot
+ entry is self referential.
+
+2. If the entry is a dotdot entry of a subdirectory, the parent's backref
+ count is bumped.
+
+3. If the entry is neither a dot nor a dotdot entry, the target file's parent
+ count is bumped.
+
+4. If the target is a subdirectory, the parent's child link count is bumped.
+
+A crucial point to understand about how the link count inode scanner interacts
+with the live update hooks is that the scan cursor tracks which *parent*
+directories have been scanned.
+In other words, the live updates ignore any update about ``A → B`` when A has
+not been scanned, even if B has been scanned.
+Furthermore, a subdirectory A with a dotdot entry pointing back to B is
+accounted as a backref counter in the shadow data for A, since child dotdot
+entries affect the parent's link count.
+Live update hooks are carefully placed in all parts of the filesystem that
+create, change, or remove directory entries, since those operations involve
+bumplink and droplink.
+
+For any file, the correct link count is the number of parents plus the number
+of child subdirectories.
+Non-directories never have children of any kind.
+The backref information is used to detect inconsistencies in the number of
+links pointing to child subdirectories and the number of dotdot entries
+pointing back.
+
+After the scan completes, the link count of each file can be checked by locking
+both the inode and the shadow data, and comparing the link counts.
+A second coordinated inode scan cursor is used for comparisons.
+Live updates are key to being able to walk every inode without needing to hold
+any locks between inodes.
+If repairs are desired, the inode's link count is set to the value in the
+shadow information.
+If no parents are found, the file must be :ref:`reparented <orphanage>` to the
+orphanage to prevent the file from being lost forever.
+
+The proposed patchset is the
+`file link count repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
+series.
+
+.. _rmap_repair:
+
+Case Study: Rebuilding Reverse Mapping Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most repair functions follow the same pattern: lock filesystem resources,
+walk the surviving ondisk metadata looking for replacement metadata records,
+and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
+The primary advantage of this approach is the simplicity and modularity of the
+repair code -- code and data are entirely contained within the scrub module,
+do not require hooks in the main filesystem, and are usually the most efficient
+in memory use.
+A secondary advantage of this repair approach is atomicity -- once the kernel
+decides a structure is corrupt, no other threads can access the metadata until
+the kernel finishes repairing and revalidating the metadata.
+
+For repairs going on within a shard of the filesystem, these advantages
+outweigh the delays inherent in locking the shard while repairing parts of the
+shard.
+Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
+btree repair strategy because it must scan every space mapping of every fork of
+every file in the filesystem, and the filesystem cannot stop.
+Therefore, rmap repair foregoes atomicity between scrub and repair.
+It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
+<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
+scan for reverse mapping records.
+
+1. Set up an xfbtree to stage rmap records.
+
+2. While holding the locks on the AGI and AGF buffers acquired during the
+ scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
+ staging extents, and the internal log.
+
+3. Set up an inode scanner.
+
+4. Hook into rmap updates for the AG being repaired so that the live scan data
+ can receive updates to the rmap btree from the rest of the filesystem during
+ the file scan.
+
+5. For each space mapping found in either fork of each file scanned,
+ decide if the mapping matches the AG of interest.
+ If so:
+
+ a. Create a btree cursor for the in-memory btree.
+
+ b. Use the rmap code to add the record to the in-memory btree.
+
+ c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+ xfbtree changes to the xfile.
+
+6. For each live update received via the hook, decide if the owner has already
+ been scanned.
+ If so, apply the live update into the scan data:
+
+ a. Create a btree cursor for the in-memory btree.
+
+ b. Replay the operation into the in-memory btree.
+
+ c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+ xfbtree changes to the xfile.
+ This is performed with an empty transaction to avoid changing the
+ caller's state.
+
+7. When the inode scan finishes, create a new scrub transaction and relock the
+ two AG headers.
+
+8. Compute the new btree geometry using the number of rmap records in the
+ shadow btree, like all other btree rebuilding functions.
+
+9. Allocate the number of blocks computed in the previous step.
+
+10. Perform the usual btree bulk loading and commit to install the new rmap
+ btree.
+
+11. Reap the old rmap btree blocks as discussed in the case study about how
+ to :ref:`reap after rmap btree repair <rmap_reap>`.
+
+12. Free the xfbtree now that it not needed.
+
+The proposed patchset is the
+`rmap repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
+series.
+
+Staging Repairs with Temporary Files on Disk
+--------------------------------------------
+
+XFS stores a substantial amount of metadata in file forks: directories,
+extended attributes, symbolic link targets, free space bitmaps and summary
+information for the realtime volume, and quota records.
+File forks map 64-bit logical file fork space extents to physical storage space
+extents, similar to how a memory management unit maps 64-bit virtual addresses
+to physical memory addresses.
+Therefore, file-based tree structures (such as directories and extended
+attributes) use blocks mapped in the file fork offset address space that point
+to other blocks mapped within that same address space, and file-based linear
+structures (such as bitmaps and quota records) compute array element offsets in
+the file fork offset address space.
+
+Because file forks can consume as much space as the entire filesystem, repairs
+cannot be staged in memory, even when a paging scheme is available.
+Therefore, online repair of file-based metadata createas a temporary file in
+the XFS filesystem, writes a new structure at the correct offsets into the
+temporary file, and atomically swaps the fork mappings (and hence the fork
+contents) to commit the repair.
+Once the repair is complete, the old fork can be reaped as necessary; if the
+system goes down during the reap, the iunlink code will delete the blocks
+during log recovery.
+
+**Note**: All space usage and inode indices in the filesystem *must* be
+consistent to use a temporary file safely!
+This dependency is the reason why online repair can only use pageable kernel
+memory to stage ondisk space usage information.
+
+Swapping metadata extents with a temporary file requires the owner field of the
+block headers to match the file being repaired and not the temporary file. The
+directory, extended attribute, and symbolic link functions were all modified to
+allow callers to specify owner numbers explicitly.
+
+There is a downside to the reaping process -- if the system crashes during the
+reap phase and the fork extents are crosslinked, the iunlink processing will
+fail because freeing space will find the extra reverse mappings and abort.
+
+Temporary files created for repair are similar to ``O_TMPFILE`` files created
+by userspace.
+They are not linked into a directory and the entire file will be reaped when
+the last reference to the file is lost.
+The key differences are that these files must have no access permission outside
+the kernel at all, they must be specially marked to prevent them from being
+opened by handle, and they must never be linked into the directory tree.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| In the initial iteration of file metadata repair, the damaged metadata |
+| blocks would be scanned for salvageable data; the extents in the file |
+| fork would be reaped; and then a new structure would be built in its |
+| place. |
+| This strategy did not survive the introduction of the atomic repair |
+| requirement expressed earlier in this document. |
+| |
+| The second iteration explored building a second structure at a high |
+| offset in the fork from the salvage data, reaping the old extents, and |
+| using a ``COLLAPSE_RANGE`` operation to slide the new extents into |
+| place. |
+| |
+| This had many drawbacks: |
+| |
+| - Array structures are linearly addressed, and the regular filesystem |
+| codebase does not have the concept of a linear offset that could be |
+| applied to the record offset computation to build an alternate copy. |
+| |
+| - Extended attributes are allowed to use the entire attr fork offset |
+| address space. |
+| |
+| - Even if repair could build an alternate copy of a data structure in a |
+| different part of the fork address space, the atomic repair commit |
+| requirement means that online repair would have to be able to perform |
+| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old |
+| structure was completely replaced. |
+| |
+| - A crash after construction of the secondary tree but before the range |
+| collapse would leave unreachable blocks in the file fork. |
+| This would likely confuse things further. |
+| |
+| - Reaping blocks after a repair is not a simple operation, and |
+| initiating a reap operation from a restarted range collapse operation |
+| during log recovery is daunting. |
+| |
+| - Directory entry blocks and quota records record the file fork offset |
+| in the header area of each block. |
+| An atomic range collapse operation would have to rewrite this part of |
+| each block header. |
+| Rewriting a single field in block headers is not a huge problem, but |
+| it's something to be aware of. |
+| |
+| - Each block in a directory or extended attributes btree index contains |
+| sibling and child block pointers. |
+| Were the atomic commit to use a range collapse operation, each block |
+| would have to be rewritten very carefully to preserve the graph |
+| structure. |
+| Doing this as part of a range collapse means rewriting a large number |
+| of blocks repeatedly, which is not conducive to quick repairs. |
+| |
+| This lead to the introduction of temporary file staging. |
++--------------------------------------------------------------------------+
+
+Using a Temporary File
+``````````````````````
+
+Online repair code should use the ``xrep_tempfile_create`` function to create a
+temporary file inside the filesystem.
+This allocates an inode, marks the in-core inode private, and attaches it to
+the scrub context.
+These files are hidden from userspace, may not be added to the directory tree,
+and must be kept private.
+
+Temporary files only use two inode locks: the IOLOCK and the ILOCK.
+The MMAPLOCK is not needed here, because there must not be page faults from
+userspace for data fork blocks.
+The usage patterns of these two locks are the same as for any other XFS file --
+access to file data are controlled via the IOLOCK, and access to file metadata
+are controlled via the ILOCK.
+Locking helpers are provided so that the temporary file and its lock state can
+be cleaned up by the scrub context.
+To comply with the nested locking strategy laid out in the :ref:`inode
+locking<ilocking>` section, it is recommended that scrub functions use the
+xrep_tempfile_ilock*_nowait lock helpers.
+
+Data can be written to a temporary file by two means:
+
+1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
+ temporary file from an xfile.
+
+2. The regular directory, symbolic link, and extended attribute functions can
+ be used to write to the temporary file.
+
+Once a good copy of a data file has been constructed in a temporary file, it
+must be conveyed to the file being repaired, which is the topic of the next
+section.
+
+The proposed patches are in the
+`repair temporary files
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
+series.
+
+Atomic Extent Swapping
+----------------------
+
+Once repair builds a temporary file with a new data structure written into
+it, it must commit the new changes into the existing file.
+It is not possible to swap the inumbers of two files, so instead the new
+metadata must replace the old.
+This suggests the need for the ability to swap extents, but the existing extent
+swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
+for online repair because:
+
+a. When the reverse-mapping btree is enabled, the swap code must keep the
+ reverse mapping information up to date with every exchange of mappings.
+ Therefore, it can only exchange one mapping per transaction, and each
+ transaction is independent.
+
+b. Reverse-mapping is critical for the operation of online fsck, so the old
+ defragmentation code (which swapped entire extent forks in a single
+ operation) is not useful here.
+
+c. Defragmentation is assumed to occur between two files with identical
+ contents.
+ For this use case, an incomplete exchange will not result in a user-visible
+ change in file contents, even if the operation is interrupted.
+
+d. Online repair needs to swap the contents of two files that are by definition
+ *not* identical.
+ For directory and xattr repairs, the user-visible contents might be the
+ same, but the contents of individual blocks may be very different.
+
+e. Old blocks in the file may be cross-linked with another structure and must
+ not reappear if the system goes down mid-repair.
+
+These problems are overcome by creating a new deferred operation and a new type
+of log intent item to track the progress of an operation to exchange two file
+ranges.
+The new deferred operation type chains together the same transactions used by
+the reverse-mapping extent swap code.
+The new log item records the progress of the exchange to ensure that once an
+exchange begins, it will always run to completion, even there are
+interruptions.
+The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+in the superblock protects these new log item records from being replayed on
+old kernels.
+
+The proposed patchset is the
+`atomic extent swap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
+series.
+
++--------------------------------------------------------------------------+
+| **Sidebar: Using Log-Incompatible Feature Flags** |
++--------------------------------------------------------------------------+
+| Starting with XFS v5, the superblock contains a |
+| ``sb_features_log_incompat`` field to indicate that the log contains |
+| records that might not readable by all kernels that could mount this |
+| filesystem. |
+| In short, log incompat features protect the log contents against kernels |
+| that will not understand the contents. |
+| Unlike the other superblock feature bits, log incompat bits are |
+| ephemeral because an empty (clean) log does not need protection. |
+| The log cleans itself after its contents have been committed into the |
+| filesystem, either as part of an unmount or because the system is |
+| otherwise idle. |
+| Because upper level code can be working on a transaction at the same |
+| time that the log cleans itself, it is necessary for upper level code to |
+| communicate to the log when it is going to use a log incompatible |
+| feature. |
+| |
+| The log coordinates access to incompatible features through the use of |
+| one ``struct rw_semaphore`` for each feature. |
+| The log cleaning code tries to take this rwsem in exclusive mode to |
+| clear the bit; if the lock attempt fails, the feature bit remains set. |
+| Filesystem code signals its intention to use a log incompat feature in a |
+| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
+| in shared mode. |
+| The code supporting a log incompat feature should create wrapper |
+| functions to obtain the log feature and call |
+| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
+| superblock. |
+| The superblock update is performed transactionally, so the wrapper to |
+| obtain log assistance must be called just prior to the creation of the |
+| transaction that uses the functionality. |
+| For a file operation, this step must happen after taking the IOLOCK |
+| and the MMAPLOCK, but before allocating the transaction. |
+| When the transaction is complete, the ``xlog_drop_incompat_feat`` |
+| function is called to release the feature. |
+| The feature bit will not be cleared from the superblock until the log |
+| becomes clean. |
+| |
+| Log-assisted extended attribute updates and atomic extent swaps both use |
+| log incompat features and provide convenience wrappers around the |
+| functionality. |
++--------------------------------------------------------------------------+
+
+Mechanics of an Atomic Extent Swap
+``````````````````````````````````
+
+Swapping entire file forks is a complex task.
+The goal is to exchange all file fork mappings between two file fork offset
+ranges.
+There are likely to be many extent mappings in each fork, and the edges of
+the mappings aren't necessarily aligned.
+Furthermore, there may be other updates that need to happen after the swap,
+such as exchanging file sizes, inode flags, or conversion of fork data to local
+format.
+This is roughly the format of the new deferred extent swap work item:
+
+.. code-block:: c
+
+ struct xfs_swapext_intent {
+ /* Inodes participating in the operation. */
+ struct xfs_inode *sxi_ip1;
+ struct xfs_inode *sxi_ip2;
+
+ /* File offset range information. */
+ xfs_fileoff_t sxi_startoff1;
+ xfs_fileoff_t sxi_startoff2;
+ xfs_filblks_t sxi_blockcount;
+
+ /* Set these file sizes after the operation, unless negative. */
+ xfs_fsize_t sxi_isize1;
+ xfs_fsize_t sxi_isize2;
+
+ /* XFS_SWAP_EXT_* log operation flags */
+ uint64_t sxi_flags;
+ };
+
+The new log intent item contains enough information to track two logical fork
+offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
+blockcount)``.
+Each step of a swap operation exchanges the largest file range mapping possible
+from one file to the other.
+After each step in the swap operation, the two startoff fields are incremented
+and the blockcount field is decremented to reflect the progress made.
+The flags field captures behavioral parameters such as swapping the attr fork
+instead of the data fork and other work to be done after the extent swap.
+The two isize fields are used to swap the file size at the end of the operation
+if the file data fork is the target of the swap operation.
+
+When the extent swap is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the extent swap.
+ At the start, it should contain the entirety of the file ranges to be
+ swapped.
+
+2. Call ``xfs_defer_finish`` to process the exchange.
+ This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+ This will log an extent swap intent item to the transaction for the deferred
+ extent swap work item.
+
+3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+
+ a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
+ ``sxi_startoff2``, respectively, and compute the longest extent that can
+ be swapped in a single step.
+ This is the minimum of the two ``br_blockcount`` s in the mappings.
+ Keep advancing through the file forks until at least one of the mappings
+ contains written blocks.
+ Mutual holes, unwritten extents, and extent mappings to the same physical
+ space are not exchanged.
+
+ For the next few steps, this document will refer to the mapping that came
+ from file 1 as "map1", and the mapping that came from file 2 as "map2".
+
+ b. Create a deferred block mapping update to unmap map1 from file 1.
+
+ c. Create a deferred block mapping update to unmap map2 from file 2.
+
+ d. Create a deferred block mapping update to map map1 into file 2.
+
+ e. Create a deferred block mapping update to map map2 into file 1.
+
+ f. Log the block, quota, and extent count updates for both files.
+
+ g. Extend the ondisk size of either file if necessary.
+
+ h. Log an extent swap done log item for the extent swap intent log item
+ that was read at the start of step 3.
+
+ i. Compute the amount of file range that has just been covered.
+ This quantity is ``(map1.br_startoff + map1.br_blockcount -
+ sxi_startoff1)``, because step 3a could have skipped holes.
+
+ j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+ by the number of blocks computed in the previous step, and decrease
+ ``sxi_blockcount`` by the same quantity.
+ This advances the cursor.
+
+ k. Log a new extent swap intent log item reflecting the advanced state of
+ the work item.
+
+ l. Return the proper error code (EAGAIN) to the deferred operation manager
+ to inform it that there is more work to be done.
+ The operation manager completes the deferred work in steps 3b-3e before
+ moving back to the start of step 3.
+
+4. Perform any post-processing.
+ This will be discussed in more detail in subsequent sections.
+
+If the filesystem goes down in the middle of an operation, log recovery will
+find the most recent unfinished extent swap log intent item and restart from
+there.
+This is how extent swapping guarantees that an outside observer will either see
+the old broken structure or the new one, and never a mismash of both.
+
+Preparation for Extent Swapping
+```````````````````````````````
+
+There are a few things that need to be taken care of before initiating an
+atomic extent swap operation.
+First, regular files require the page cache to be flushed to disk before the
+operation begins, and directio writes to be quiesced.
+Like any filesystem operation, extent swapping must determine the maximum
+amount of disk space and quota that can be consumed on behalf of both files in
+the operation, and reserve that quantity of resources to avoid an unrecoverable
+out of space failure once it starts dirtying metadata.
+The preparation step scans the ranges of both files to estimate:
+
+- Data device blocks needed to handle the repeated updates to the fork
+ mappings.
+- Change in data and realtime block counts for both files.
+- Increase in quota usage for both files, if the two files do not share the
+ same set of quota ids.
+- The number of extent mappings that will be added to each file.
+- Whether or not there are partially written realtime extents.
+ User programs must never be able to access a realtime file extent that maps
+ to different extents on the realtime volume, which could happen if the
+ operation fails to run to completion.
+
+The need for precise estimation increases the run time of the swap operation,
+but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the extent
+swap ever add more extent mappings to a fork than it can support.
+Regular users are required to abide the quota limits, though metadata repairs
+may exceed quota to resolve inconsistent metadata elsewhere.
+
+Special Features for Swapping Metadata File Extents
+```````````````````````````````````````````````````
+
+Extended attributes, symbolic links, and directories can set the fork format to
+"local" and treat the fork as a literal area for data storage.
+Metadata repairs must take extra steps to support these cases:
+
+- If both forks are in local format and the fork areas are large enough, the
+ swap is performed by copying the incore fork contents, logging both forks,
+ and committing.
+ The atomic extent swap mechanism is not necessary, since this can be done
+ with a single transaction.
+
+- If both forks map blocks, then the regular atomic extent swap is used.
+
+- Otherwise, only one fork is in local format.
+ The contents of the local format fork are converted to a block to perform the
+ swap.
+ The conversion to block format must be done in the same transaction that
+ logs the initial extent swap intent log item.
+ The regular atomic extent swap is used to exchange the mappings.
+ Special flags are set on the swap operation so that the transaction can be
+ rolled one more time to convert the second file's fork back to local format
+ so that the second file will be ready to go as soon as the ILOCK is dropped.
+
+Extended attributes and directories stamp the owning inode into every block,
+but the buffer verifiers do not actually check the inode number!
+Although there is no verification, it is still important to maintain
+referential integrity, so prior to performing the extent swap, online repair
+builds every block in the new data structure with the owner field of the file
+being repaired.
+
+After a successful swap operation, the repair operation must reap the old fork
+blocks by processing each fork mapping through the standard :ref:`file extent
+reaping <reaping>` mechanism that is done post-repair.
+If the filesystem should go down during the reap part of the repair, the
+iunlink processing at the end of recovery will free both the temporary file and
+whatever blocks were not reaped.
+However, this iunlink processing omits the cross-link detection of online
+repair, and is not completely foolproof.
+
+Swapping Temporary File Extents
+```````````````````````````````
+
+To repair a metadata file, online repair proceeds as follows:
+
+1. Create a temporary repair file.
+
+2. Use the staging data to write out new contents into the temporary repair
+ file.
+ The same fork must be written to as is being repaired.
+
+3. Commit the scrub transaction, since the swap estimation step must be
+ completed before transaction reservations are made.
+
+4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+ the appropriate resource reservations, locks, and fill out a ``struct
+ xfs_swapext_req`` with the details of the swap operation.
+
+5. Call ``xrep_tempswap_contents`` to swap the contents.
+
+6. Commit the transaction to complete the repair.
+
+.. _rtsummary:
+
+Case Study: Repairing the Realtime Summary File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the "realtime" section of an XFS filesystem, free space is tracked via a
+bitmap, similar to Unix FFS.
+Each bit in the bitmap represents one realtime extent, which is a multiple of
+the filesystem block size between 4KiB and 1GiB in size.
+The realtime summary file indexes the number of free extents of a given size to
+the offset of the block within the realtime free space bitmap where those free
+extents begin.
+In other words, the summary file helps the allocator find free extents by
+length, similar to what the free space by count (cntbt) btree does for the data
+section.
+
+The summary file itself is a flat file (with no block headers or checksums!)
+partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
+counters to match the number of blocks in the rt bitmap.
+Each counter records the number of free extents that start in that bitmap block
+and can satisfy a power-of-two allocation request.
+
+To check the summary file against the bitmap:
+
+1. Take the ILOCK of both the realtime bitmap and summary files.
+
+2. For each free space extent recorded in the bitmap:
+
+ a. Compute the position in the summary file that contains a counter that
+ represents this free extent.
+
+ b. Read the counter from the xfile.
+
+ c. Increment it, and write it back to the xfile.
+
+3. Compare the contents of the xfile against the ondisk file.
+
+To repair the summary file, write the xfile contents into the temporary file
+and use atomic extent swap to commit the new contents.
+The temporary file is then reaped.
+
+The proposed patchset is the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Case Study: Salvaging Extended Attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In XFS, extended attributes are implemented as a namespaced name-value store.
+Values are limited in size to 64KiB, but there is no limit in the number of
+names.
+The attribute fork is unpartitioned, which means that the root of the attribute
+structure is always in logical block zero, but attribute leaf blocks, dabtree
+index blocks, and remote value blocks are intermixed.
+Attribute leaf blocks contain variable-sized records that associate
+user-provided names with the user-provided values.
+Values larger than a block are allocated separate extents and written there.
+If the leaf information expands beyond a single block, a directory/attribute
+btree (``dabtree``) is created to map hashes of attribute names to entries
+for fast lookup.
+
+Salvaging extended attributes is done as follows:
+
+1. Walk the attr fork mappings of the file being repaired to find the attribute
+ leaf blocks.
+ When one is found,
+
+ a. Walk the attr leaf block to find candidate keys.
+ When one is found,
+
+ 1. Check the name for problems, and ignore the name if there are.
+
+ 2. Retrieve the value.
+ If that succeeds, add the name and value to the staging xfarray and
+ xfblob.
+
+2. If the memory usage of the xfarray and xfblob exceed a certain amount of
+ memory or there are no more attr fork blocks to examine, unlock the file and
+ add the staged extended attributes to the temporary file.
+
+3. Use atomic extent swapping to exchange the new and old extended attribute
+ structures.
+ The old attribute blocks are now attached to the temporary file.
+
+4. Reap the temporary file.
+
+The proposed patchset is the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
+series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features,
+since directory entries are not redundant.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
+and moving orphans to the ``/lost+found`` directory.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+ If the dotdot entry is not unreadable, try to confirm that the alleged
+ parent has a child entry pointing back to the directory being repaired.
+ Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+ entry data blocks.
+ When one is found,
+
+ a. Walk the directory data block to find candidate entries.
+ When an entry is found:
+
+ i. Check the name for problems, and ignore the name if there are.
+
+ ii. Retrieve the inumber and grab the inode.
+ If that succeeds, add the name, inode number, and file type to the
+ staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+ memory or there are no more directory data blocks to examine, unlock the
+ directory and add the staged dirents into the temporary directory.
+ Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+ The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Future Work Question**: Should repair revalidate the dentry cache when
+rebuilding a directory?
+
+*Answer*: Yes, it should.
+
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+ directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+ purged.
+ This is the problem case.
+
+Unfortunately, the current dentry cache design doesn't provide a means to walk
+every child dentry of a specific directory, which makes this a hard problem.
+There is no known solution.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+Parent Pointers
+```````````````
+
+A parent pointer is a piece of file metadata that enables a user to locate the
+file's parent directory without having to traverse the directory tree from the
+root.
+Without them, reconstruction of directory trees is hindered in much the same
+way that the historic lack of reverse space mapping information once hindered
+reconstruction of filesystem space metadata.
+The parent pointer feature, however, makes total directory reconstruction
+possible.
+
+XFS parent pointers include the dirent name and location of the entry within
+the parent directory.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+The directory checking process can be strengthened to ensure that the target of
+each dirent also contains a parent pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
+**Note**: The ondisk format of parent pointers is not yet finalized.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| Directory parent pointers were first proposed as an XFS feature more |
+| than a decade ago by SGI. |
+| Each link from a parent directory to a child file is mirrored with an |
+| extended attribute in the child that could be used to identify the |
+| parent directory. |
+| Unfortunately, this early implementation had major shortcomings and was |
+| never merged into Linux XFS: |
+| |
+| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
+| enforce strong referential integrity in the directory tree. |
+| It did not guarantee that a change in a forward link would always be |
+| followed up with the corresponding change to the reverse links. |
+| |
+| 2. Referential integrity was not integrated into offline repair. |
+| Checking and repairs were performed on mounted filesystems without |
+| taking any kernel or inode locks to coordinate access. |
+| It is not clear how this actually worked properly. |
+| |
+| 3. The extended attribute did not record the name of the directory entry |
+| in the parent, so the SGI parent pointer implementation cannot be |
+| used to reconnect the directory tree. |
+| |
+| 4. Extended attribute forks only support 65,536 extents, which means |
+| that parent pointer attribute creation is likely to fail at some |
+| point before the maximum file link count is achieved. |
+| |
+| The original parent pointer design was too unstable for something like |
+| a file system repair to depend on. |
+| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a |
+| second implementation that solves all shortcomings of the first. |
+| During 2022, Allison introduced log intent items to track physical |
+| manipulations of the extended attribute structures. |
+| This solves the referential integrity problem by making it possible to |
+| commit a dirent update and a parent pointer update in the same |
+| transaction. |
+| Chandan increased the maximum extent counts of both data and attribute |
+| forks, thereby ensuring that the extended attribute structure can grow |
+| to handle the maximum hardlink count of any file. |
++--------------------------------------------------------------------------+
+
+Case Study: Repairing Directories with Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>` as follows:
+
+1. Set up a temporary directory for generating the new directory structure,
+ an xfblob for storing entry names, and an xfarray for stashing directory
+ updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+ updates on directory operations.
+
+3. For each parent pointer found in each file scanned, decide if the parent
+ pointer references the directory of interest.
+ If so:
+
+ a. Stash an addname entry for this dirent in the xfarray for later.
+
+ b. When finished scanning that file, flush the stashed updates to the
+ temporary directory.
+
+4. For each live directory update received via the hook, decide if the child
+ has already been scanned.
+ If so:
+
+ a. Stash an addname or removename entry for this dirent update in the
+ xfarray for later.
+ We cannot write directly to the temporary directory because hook
+ functions are not allowed to modify filesystem metadata.
+ Instead, we stash updates in the xfarray and rely on the scanner thread
+ to apply the stashed updates to the temporary directory.
+
+5. When the scan is complete, atomically swap the contents of the temporary
+ directory and the directory being repaired.
+ The temporary directory now contains the damaged directory structure.
+
+6. Reap the temporary directory.
+
+7. Update the dirent position field of parent pointers as necessary.
+ This may require the queuing of a substantial number of xattr log intent
+ items.
+
+The proposed patchset is the
+`parent pointers directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+series.
+
+**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
+match in the reconstructed directory?
+
+*Answer*: There are a few ways to solve this problem:
+
+1. The field could be designated advisory, since the other three values are
+ sufficient to find the entry in the parent.
+ However, this makes indexed key lookup impossible while repairs are ongoing.
+
+2. We could allow creating directory entries at specified offsets, which solves
+ the referential integrity problem but runs the risk that dirent creation
+ will fail due to conflicts with the free space in the directory.
+
+ These conflicts could be resolved by appending the directory entry and
+ amending the xattr code to support updating an xattr key and reindexing the
+ dabtree, though this would have to be performed with the parent directory
+ still locked.
+
+3. Same as above, but remove the old parent pointer entry and add a new one
+ atomically.
+
+4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
+ which would provide the attr name uniqueness that we require, without
+ forcing repair code to update the dirent position.
+ Unfortunately, this requires changes to the xattr code to support attr
+ names as long as 263 bytes.
+
+5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
+ (name, parent_gen)``.
+ If the hash is sufficiently resistant to collisions (e.g. sha256) then
+ this should provide the attr name uniqueness that we require.
+ Names shorter than 247 bytes could be stored directly.
+
+Discussion is ongoing under the `parent pointers patch deluge
+<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
+
+Case Study: Repairing Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Online reconstruction of a file's parent pointer information works similarly to
+directory reconstruction:
+
+1. Set up a temporary file for generating a new extended attribute structure,
+ an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
+ stashing parent pointer updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+ updates on directory operations.
+
+3. For each directory entry found in each directory scanned, decide if the
+ dirent references the file of interest.
+ If so:
+
+ a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
+ for later.
+
+ b. When finished scanning the directory, flush the stashed updates to the
+ temporary directory.
+
+4. For each live directory update received via the hook, decide if the parent
+ has already been scanned.
+ If so:
+
+ a. Stash an addpptr or removepptr entry for this dirent update in the
+ xfarray for later.
+ We cannot write parent pointers directly to the temporary file because
+ hook functions are not allowed to modify filesystem metadata.
+ Instead, we stash updates in the xfarray and rely on the scanner thread
+ to apply the stashed parent pointer updates to the temporary file.
+
+5. Copy all non-parent pointer extended attributes to the temporary file.
+
+6. When the scan is complete, atomically swap the attribute fork of the
+ temporary file and the file being repaired.
+ The temporary file now contains the damaged extended attribute structure.
+
+7. Reap the temporary file.
+
+The proposed patchset is the
+`parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+series.
+
+Digression: Offline Checking of Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (i.e. phase 6),
+ walk the surviving directories of each AG in the filesystem.
+ This is already performed as part of the connectivity checks.
+
+2. For each directory entry found, record the name in an xfblob, and store
+ ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
+ per-AG in-memory slab.
+
+3. For each AG in the filesystem,
+
+ a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
+ dirent_pos.
+
+ b. For each inode in the AG,
+
+ 1. Scan the inode for parent pointers.
+ Record the names in a per-file xfblob, and store ``(parent_inum,
+ parent_gen, dirent_pos)`` tuples in a per-file slab.
+
+ 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+
+ 3. Position one slab cursor at the start of the inode's records in the
+ per-AG tuple slab.
+ This should be trivial since the per-AG tuples are in child inumber
+ order.
+
+ 4. Position a second slab cursor at the start of the per-file tuple slab.
+
+ 5. Iterate the two cursors in lockstep, comparing the parent_ino and
+ dirent_pos fields of the records under each cursor.
+
+ a. Tuples in the per-AG list but not the per-file list are missing and
+ need to be written to the inode.
+
+ b. Tuples in the per-file list but not the per-AG list are dangling
+ and need to be removed from the inode.
+
+ c. For tuples in both lists, update the parent_gen and name components
+ of the parent pointer if necessary.
+
+4. Move on to examining link counts, as we do today.
+
+The proposed patchset is the
+`offline parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+series.
+
+Rebuilding directories from parent pointers in offline repair is very
+challenging because it currently uses a single-pass scan of the filesystem
+during phase 3 to decide which files are corrupt enough to be zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+ much as it does now.
+ Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+ as being corrupt in the first pass.
+ This second pass may have to happen after the phase 4 scan for duplicate
+ blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+ Free space metadata has not been ensured yet, so repair cannot yet use the
+ directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+ Use the parent pointer information recorded during step 2 to reconstruct
+ the dirents and add them to the now-empty directories.
+
+This code has not yet been constructed.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+
+Without parent pointers, the directory parent pointer online scrub code can
+detect a dotdot entry pointing to a parent directory that doesn't have a link
+back to the child directory and the file link count checker can detect a file
+that isn't pointed to by any directory in the filesystem.
+If such a file has a positive link count, the file is an orphan.
+
+With parent pointers, directories can be rebuilt by scanning parent pointers
+and parent pointers can be rebuilt by scanning directories.
+This should reduce the incidence of files ending up in ``/lost+found``.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+ to try to ensure that the lost and found directory actually exists.
+ This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+ orphanage and the file being reattached.
+ The ``xrep_orphanage_iolock_two`` function follows the inode locking
+ strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+ to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+ transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+ and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.
+
+6. Userspace Algorithms and Data Structures
+===========================================
+
+This section discusses the key algorithms and data structures of the userspace
+program, ``xfs_scrub``, that provide the ability to drive metadata checks and
+repairs in the kernel, verify file data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
+That structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+ the allocation group space btrees, and the realtime volume space
+ information.
+
+b. Quota resource counts depend on consistency within the quota file data
+ forks, inode indices, inode records, and the forks of every file on the
+ system.
+
+c. The naming hierarchy depends on consistency within the directory and
+ extended attribute structures.
+ This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+ the file forks that map directory and extended attribute data to physical
+ storage media.
+
+e. The file forks depends on consistency within inode records and the space
+ metadata indices of the allocation groups and the realtime volume.
+ This includes quota and realtime metadata files.
+
+f. Inode records depends on consistency within the inode metadata indices.
+
+g. Realtime space metadata depend on the inode records and data forks of the
+ realtime metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+ and reverse mapping btrees) depend on consistency within the AG headers and
+ between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+ for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+ the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 scans inodes in parallel.
+ For each inode, groups (f), (e), and (d) are checked, in that order.
+
+- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
+ may run reliably.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+ to checking names.
+
+- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
+ to read them, and to report which blocks of which files are affected.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain hundreds of millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+it is desirable to scrub inodes in parallel to minimize runtime, particularly
+if the program has been invoked manually from a command line.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single workqueue item per AG.
+Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
+inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
+information to construct file handles.
+The file handle was then passed to a function to generate scrub items for each
+metadata object of each inode.
+This simple algorithm leads to thread balancing problems in phase 3 if the
+filesystem contains one AG with a few large sparse files and the rest of the
+AGs contain many smaller files.
+The inode scan dispatch function was not sufficiently granular; it should have
+been dispatching at the level of individual inodes, or, to constrain memory
+consumption, inode btree records.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
+avoid this problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one workqueue item per AG,
+and it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is configured with an upper bound on the number
+of items that can be waiting to be run.
+Each inode btree chunk found by the first workqueue's workers are queued to the
+second workqueue, and it is this second workqueue that queries BULKSTAT,
+creates a file handle, and passes it to a function to generate scrub items for
+each metadata object of each inode.
+If the second workqueue is too full, the workqueue add function blocks the
+first workqueue's workers until the backlog eases.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems reported in any other space metadata are deferred to phase 4.
+Optimization opportunities are always deferred to phase 4, no matter their
+origin.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata are repaired immediately if all space metadata were validated
+during phase 2.
+Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
+
+In the original design of ``xfs_scrub``, it was thought that repairs would be
+so infrequent that the ``struct xfs_scrub_metadata`` objects used to
+communicate with the kernel could also be used as the primary object to
+schedule repairs.
+With recent increases in the number of optimizations possible for a given
+filesystem object, it became much more memory-efficient to track all eligible
+repairs for a given filesystem object with a single repair item.
+Each repair item represents a single lockable object -- AGs, metadata files,
+individual inodes, or a class of summary information.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that ``xfs_scrub`` must try to complete the repair work scheduled by
+phase 2 before trying repair work scheduled by phase 3.
+The repair process is as follows:
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+ as busy as the user desires.
+
+ a. For each repair item queued by phase 2,
+
+ i. Ask the kernel to repair everything listed in the repair item for a
+ given filesystem object.
+
+ ii. Make a note if the kernel made any progress in reducing the number
+ of repairs needed for this object.
+
+ iii. If the object no longer requires repairs, revalidate all metadata
+ associated with this object.
+ If the revalidation succeeds, drop the repair item.
+ If not, requeue the item for more repairs.
+
+ b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
+
+ c. For each repair item queued by phase 3,
+
+ i. Ask the kernel to repair everything listed in the repair item for a
+ given filesystem object.
+
+ ii. Make a note if the kernel made any progress in reducing the number
+ of repairs needed for this object.
+
+ iii. If the object no longer requires repairs, revalidate all metadata
+ associated with this object.
+ If the revalidation succeeds, drop the repair item.
+ If not, requeue the item for more repairs.
+
+ d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+ another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+ Complain if the repairs were not successful, since this is the last chance
+ to repair anything.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+Corrupt file data blocks reported by phase 6 cannot be recovered by the
+filesystem.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to phase 5, which checks for suspicious looking names in
+the filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name:
+
+- Slashes and null bytes are not allowed in directory entries.
+
+- Null bytes are not allowed in userspace-visible extended attributes.
+
+- Null bytes are not allowed in the filesystem label.
+
+Directory entries and attribute keys store the length of the name explicitly
+ondisk, which means that nulls are not name terminators.
+For this section, the term "naming domain" refers to any place where names are
+presented together -- all the names in a directory, or all the attributes of a
+file.
+
+Although the Unix naming constraints are very permissive, the reality of most
+modern-day Linux systems is that programs work with Unicode character code
+points to support international languages.
+These programs typically encode those code points in UTF-8 when interfacing
+with the C library because the kernel expects null-terminated names.
+In the common case, therefore, names found in an XFS filesystem are actually
+UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that render similarly or identically in writing
+systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+Both sequences render identically.
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name, the
+name will render identically to a name that does not have the zero width
+space.
+
+If two names within a naming domain have different byte sequences but render
+identically, a user may be confused by it.
+The kernel, in its indifference to upper level encoding schemes, permits this.
+Most filesystem drivers persist the byte sequence names that are given to them
+by the VFS.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps between data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk as a directio read of the raw block
+device.
+
+If the verification read fails, ``xfs_scrub`` retries with single-block reads
+to narrow down the failure to the specific region of the media and recorded.
+When it has finished issuing verification requests, it again uses the space
+mapping ioctl to map the recorded media errors back to metadata structures
+and report what has been lost.
+For media errors in blocks owned by files, parent pointers can be used to
+construct file paths from inode numbers for user-friendly reporting.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+FIEXCHANGE_RANGE
+----------------
+
+As discussed earlier, a second frontend to the atomic extent swap mechanism is
+a new ioctl call that userspace programs can use to commit updates to files
+atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+Extent Swapping with Regular User Files
+```````````````````````````````````````
+
+As mentioned earlier, XFS has long had the ability to swap extents between
+files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
+The earliest form of this was the fork swap mechanism, where the entire
+contents of data forks could be exchanged between two files by exchanging the
+raw bytes in each inode fork's immediate area.
+When XFS v5 came along with self-describing metadata, this old mechanism grew
+some log support to continue rewriting the owner fields of BMBT blocks during
+log recovery.
+When the reverse mapping btree was later added to XFS, the only way to maintain
+the consistency of the fork mappings with the reverse mapping index was to
+develop an iterative mechanism that used deferred bmap and rmap operations to
+swap mappings one at a time.
+This mechanism is identical to steps 2-3 from the procedure above except for
+the new tracking items, because the atomic extent swap mechanism is an
+iteration of an existing mechanism and not something totally novel.
+For the narrow case of file defragmentation, the file contents must be
+identical, so the recovery guarantees are not much of a gain.
+
+Atomic extent swapping is much more flexible than the existing swapext
+implementations because it can guarantee that the caller never sees a mix of
+old and new contents even after a crash, and it can operate on two arbitrary
+file fork ranges.
+The extra flexibility enables several new use cases:
+
+- **Atomic commit of file writes**: A userspace process opens a file that it
+ wants to update.
+ Next, it opens a temporary file and calls the file clone operation to reflink
+ the first file's contents into the temporary file.
+ Writes to the original file should instead be written to the temporary file.
+ Finally, the process calls the atomic extent swap system call
+ (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
+ of the updates to the original file, or none of them.
+
+.. _swapext_if_unchanged:
+
+- **Transactional file updates**: The same mechanism as above, but the caller
+ only wants the commit to occur if the original file's contents have not
+ changed.
+ To make this happen, the calling process snapshots the file modification and
+ change timestamps of the original file before reflinking its data to the
+ temporary file.
+ When the program is ready to commit the changes, it passes the timestamps
+ into the kernel as arguments to the atomic extent swap system call.
+ The kernel only commits the changes if the provided timestamps match the
+ original file.
+
+- **Emulation of atomic block device writes**: Export a block device with a
+ logical sector size matching the filesystem block size to force all writes
+ to be aligned to the filesystem block size.
+ Stage all writes to a temporary file, and when that is complete, call the
+ atomic extent swap system call with a flag to indicate that holes in the
+ temporary file should be ignored.
+ This emulates an atomic device write in software, and can support arbitrary
+ scattered writes.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
+<swapext_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Future Work Question**: How might the filesystem move inode chunks?
+
+*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
+that creates a new file with the old contents and then locklessly runs around
+the filesystem updating directory entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Future Work Question**: Can static keys be used to minimize the cost of
+supporting ``revoke()`` on XFS files?
+
+*Answer*: Yes.
+Until the first revocation, the bailout code need not be in the call path at
+all.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
index 8db0121d0980..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.txt
+++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
@@ -1,8 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
+
+============================
XFS Self Describing Metadata
-----------------------------
+============================
Introduction
-------------
+============
The largest scalability problem facing XFS is not one of algorithmic
scalability, but of verification of the filesystem structure. Scalabilty of the
@@ -34,7 +38,7 @@ required for basic forensic analysis of the filesystem structure.
Self Describing Metadata
-------------------------
+========================
One of the problems with the current metadata format is that apart from the
magic number in the metadata block, we have no other way of identifying what it
@@ -142,7 +146,7 @@ modification occurred between the corruption being written and when it was
detected.
Runtime Validation
-------------------
+==================
Validation of self-describing metadata takes place at runtime in two places:
@@ -183,18 +187,18 @@ error occurs during this process, the buffer is again marked with a EFSCORRUPTED
error for the higher layers to catch.
Structures
-----------
+==========
-A typical on-disk structure needs to contain the following information:
+A typical on-disk structure needs to contain the following information::
-struct xfs_ondisk_hdr {
- __be32 magic; /* magic number */
- __be32 crc; /* CRC, not logged */
- uuid_t uuid; /* filesystem identifier */
- __be64 owner; /* parent object */
- __be64 blkno; /* location on disk */
- __be64 lsn; /* last modification in log, not logged */
-};
+ struct xfs_ondisk_hdr {
+ __be32 magic; /* magic number */
+ __be32 crc; /* CRC, not logged */
+ uuid_t uuid; /* filesystem identifier */
+ __be64 owner; /* parent object */
+ __be64 blkno; /* location on disk */
+ __be64 lsn; /* last modification in log, not logged */
+ };
Depending on the metadata, this information may be part of a header structure
separate to the metadata contents, or may be distributed through an existing
@@ -214,24 +218,24 @@ level of information is generally provided. For example:
well. hence the additional metadata headers change the overall format
of the metadata.
-A typical buffer read verifier is structured as follows:
+A typical buffer read verifier is structured as follows::
-#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
+ #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
-static void
-xfs_foo_read_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
+ static void
+ xfs_foo_read_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
- if ((xfs_sb_version_hascrc(&mp->m_sb) &&
- !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
- XFS_FOO_CRC_OFF)) ||
- !xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- }
-}
+ if ((xfs_sb_version_hascrc(&mp->m_sb) &&
+ !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
+ XFS_FOO_CRC_OFF)) ||
+ !xfs_foo_verify(bp)) {
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+ xfs_buf_ioerror(bp, EFSCORRUPTED);
+ }
+ }
The code ensures that the CRC is only checked if the filesystem has CRCs enabled
by checking the superblock of the feature bit, and then if the CRC verifies OK
@@ -239,83 +243,83 @@ by checking the superblock of the feature bit, and then if the CRC verifies OK
The verifier function will take a couple of different forms, depending on
whether the magic number can be used to determine the format of the block. In
-the case it can't, the code is structured as follows:
+the case it can't, the code is structured as follows::
-static bool
-xfs_foo_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
+ static bool
+ xfs_foo_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
- if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
+ if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+ return false;
- if (!xfs_sb_version_hascrc(&mp->m_sb)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- }
+ if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+ if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+ return false;
+ if (bp->b_bn != be64_to_cpu(hdr->blkno))
+ return false;
+ if (hdr->owner == 0)
+ return false;
+ }
- /* object specific verification checks here */
+ /* object specific verification checks here */
- return true;
-}
+ return true;
+ }
If there are different magic numbers for the different formats, the verifier
-will look like:
-
-static bool
-xfs_foo_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
-
- if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
-
- /* object specific verification checks here */
-
- return true;
-}
+will look like::
+
+ static bool
+ xfs_foo_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
+
+ if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
+ if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+ return false;
+ if (bp->b_bn != be64_to_cpu(hdr->blkno))
+ return false;
+ if (hdr->owner == 0)
+ return false;
+ } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+ return false;
+
+ /* object specific verification checks here */
+
+ return true;
+ }
Write verifiers are very similar to the read verifiers, they just do things in
-the opposite order to the read verifiers. A typical write verifier:
+the opposite order to the read verifiers. A typical write verifier::
-static void
-xfs_foo_write_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_buf_log_item *bip = bp->b_fspriv;
+ static void
+ xfs_foo_write_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_buf_log_item *bip = bp->b_fspriv;
- if (!xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- return;
- }
+ if (!xfs_foo_verify(bp)) {
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+ xfs_buf_ioerror(bp, EFSCORRUPTED);
+ return;
+ }
- if (!xfs_sb_version_hascrc(&mp->m_sb))
- return;
+ if (!xfs_sb_version_hascrc(&mp->m_sb))
+ return;
- if (bip) {
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
- hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
- }
- xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
-}
+ if (bip) {
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
+ hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
+ }
+ xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
+ }
This will verify the internal structure of the metadata before we go any
further, detecting corruptions that have occurred as the metadata has been
@@ -324,7 +328,7 @@ update the LSN field (when it was last modified) and calculate the CRC on the
metadata. Once this is done, we can issue the IO.
Inodes and Dquots
------------------
+=================
Inodes and dquots are special snowflakes. They have per-object CRC and
self-identifiers, but they are packed so that there are multiple objects per
@@ -337,14 +341,13 @@ buffer.
The structure of the verifiers and the identifiers checks is very similar to the
buffer code described above. The only difference is where they are called. For
-example, inode read verification is done in xfs_iread() when the inode is first
-read out of the buffer and the struct xfs_inode is instantiated. The inode is
-already extensively verified during writeback in xfs_iflush_int, so the only
-addition here is to add the LSN and CRC to the inode as it is copied back into
-the buffer.
+example, inode read verification is done in xfs_inode_from_disk() when the inode
+is first read out of the buffer and the struct xfs_inode is instantiated. The
+inode is already extensively verified during writeback in xfs_iflush_int, so the
+only addition here is to add the LSN and CRC to the inode as it is copied back
+into the buffer.
XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
the unlinked list modifications check or update CRCs, neither during unlink nor
log recovery. So, it's gone unnoticed until now. This won't matter immediately -
repair will probably complain about it - but it needs to be fixed.
-
diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst
index 71d845c6a700..c22124c2213d 100644
--- a/Documentation/filesystems/zonefs.rst
+++ b/Documentation/filesystems/zonefs.rst
@@ -110,14 +110,14 @@ contain files named "0", "1", "2", ... The file numbers also represent
increasing zone start sector on the device.
All read and write operations to zone files are not allowed beyond the file
-maximum size, that is, beyond the zone size. Any access exceeding the zone
-size is failed with the -EFBIG error.
+maximum size, that is, beyond the zone capacity. Any access exceeding the zone
+capacity is failed with the -EFBIG error.
Creating, deleting, renaming or modifying any attribute of files and
sub-directories is not allowed.
The number of blocks of a file as reported by stat() and fstat() indicates the
-size of the file zone, or in other words, the maximum file size.
+capacity of the zone file, or in other words, the maximum file size.
Conventional zone files
-----------------------
@@ -156,8 +156,8 @@ all accepted.
Truncating sequential zone files is allowed only down to 0, in which case, the
zone is reset to rewind the file zone write pointer position to the start of
-the zone, or up to the zone size, in which case the file's zone is transitioned
-to the FULL state (finish zone operation).
+the zone, or up to the zone capacity, in which case the file's zone is
+transitioned to the FULL state (finish zone operation).
Format options
--------------
@@ -306,8 +306,15 @@ Further notes:
Mount options
-------------
-zonefs define the "errors=<behavior>" mount option to allow the user to specify
-zonefs behavior in response to I/O errors, inode size inconsistencies or zone
+zonefs defines several mount options:
+* errors=<behavior>
+* explicit-open
+
+"errors=<behavior>" option
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The "errors=<behavior>" option mount option allows the user to specify zonefs
+behavior in response to I/O errors, inode size inconsistencies or zone
condition changes. The defined behaviors are as follow:
* remount-ro (default)
@@ -324,7 +331,63 @@ file size set to 0. This is necessary as the write pointer of read-only zones
is defined as invalib by the ZBC and ZAC standards, making it impossible to
discover the amount of data that has been written to the zone. In the case of a
read-only zone discovered at run-time, as indicated in the previous section.
-the size of the zone file is left unchanged from its last updated value.
+The size of the zone file is left unchanged from its last updated value.
+
+"explicit-open" option
+~~~~~~~~~~~~~~~~~~~~~~
+
+A zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
+the number of zones that can be active, that is, zones that are in the
+implicit open, explicit open or closed conditions. This potential limitation
+translates into a risk for applications to see write IO errors due to this
+limit being exceeded if the zone of a file is not already active when a write
+request is issued by the user.
+
+To avoid these potential errors, the "explicit-open" mount option forces zones
+to be made active using an open zone command when a file is opened for writing
+for the first time. If the zone open command succeeds, the application is then
+guaranteed that write requests can be processed. Conversely, the
+"explicit-open" mount option will result in a zone close command being issued
+to the device on the last close() of a zone file if the zone is not full nor
+empty.
+
+Runtime sysfs attributes
+------------------------
+
+zonefs defines several sysfs attributes for mounted devices. All attributes
+are user readable and can be found in the directory /sys/fs/zonefs/<dev>/,
+where <dev> is the name of the mounted zoned block device.
+
+The attributes defined are as follows.
+
+* **max_wro_seq_files**: This attribute reports the maximum number of
+ sequential zone files that can be open for writing. This number corresponds
+ to the maximum number of explicitly or implicitly open zones that the device
+ supports. A value of 0 means that the device has no limit and that any zone
+ (any file) can be open for writing and written at any time, regardless of the
+ state of other zones. When the *explicit-open* mount option is used, zonefs
+ will fail any open() system call requesting to open a sequential zone file for
+ writing when the number of sequential zone files already open for writing has
+ reached the *max_wro_seq_files* limit.
+* **nr_wro_seq_files**: This attribute reports the current number of sequential
+ zone files open for writing. When the "explicit-open" mount option is used,
+ this number can never exceed *max_wro_seq_files*. If the *explicit-open*
+ mount option is not used, the reported number can be greater than
+ *max_wro_seq_files*. In such case, it is the responsibility of the
+ application to not write simultaneously more than *max_wro_seq_files*
+ sequential zone files. Failure to do so can result in write errors.
+* **max_active_seq_files**: This attribute reports the maximum number of
+ sequential zone files that are in an active state, that is, sequential zone
+ files that are partially written (not empty nor full) or that have a zone that
+ is explicitly open (which happens only if the *explicit-open* mount option is
+ used). This number is always equal to the maximum number of active zones that
+ the device supports. A value of 0 means that the mounted device has no limit
+ on the number of sequential zone files that can be active.
+* **nr_active_seq_files**: This attributes reports the current number of
+ sequential zone files that are active. If *max_active_seq_files* is not 0,
+ then the value of *nr_active_seq_files* can never exceed the value of
+ *nr_active_seq_files*, regardless of the use of the *explicit-open* mount
+ option.
Zonefs User Space Tools
=======================
@@ -401,8 +464,9 @@ append-writes to the file::
# ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
-Since files are statically mapped to zones on the disk, the number of blocks of
-a file as reported by stat() and fstat() indicates the size of the file zone::
+Since files are statically mapped to zones on the disk, the number of blocks
+of a file as reported by stat() and fstat() indicates the capacity of the file
+zone::
# stat /mnt/seq/0
File: /mnt/seq/0
@@ -416,5 +480,6 @@ a file as reported by stat() and fstat() indicates the size of the file zone::
The number of blocks of the file ("Blocks") in units of 512B blocks gives the
maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
-size in this example. Of note is that the "IO block" field always indicates the
-minimum I/O size for writes and corresponds to the device physical sector size.
+capacity in this example. Of note is that the "IO block" field always
+indicates the minimum I/O size for writes and corresponds to the device
+physical sector size.