aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/00-INDEX153
-rw-r--r--Documentation/filesystems/caching/backend-api.txt2
-rw-r--r--Documentation/filesystems/caching/cachefiles.txt4
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt2
-rw-r--r--Documentation/filesystems/caching/operations.txt2
-rw-r--r--Documentation/filesystems/ceph.txt5
-rw-r--r--Documentation/filesystems/cifs/TODO26
-rw-r--r--Documentation/filesystems/configfs/configfs.txt2
-rw-r--r--Documentation/filesystems/dax.txt2
-rw-r--r--Documentation/filesystems/ext2.txt2
-rw-r--r--Documentation/filesystems/ext4/about.rst (renamed from Documentation/filesystems/ext4/ondisk/about.rst)0
-rw-r--r--Documentation/filesystems/ext4/allocators.rst (renamed from Documentation/filesystems/ext4/ondisk/allocators.rst)0
-rw-r--r--Documentation/filesystems/ext4/attributes.rst (renamed from Documentation/filesystems/ext4/ondisk/attributes.rst)8
-rw-r--r--Documentation/filesystems/ext4/bigalloc.rst (renamed from Documentation/filesystems/ext4/ondisk/bigalloc.rst)0
-rw-r--r--Documentation/filesystems/ext4/bitmaps.rst (renamed from Documentation/filesystems/ext4/ondisk/bitmaps.rst)0
-rw-r--r--Documentation/filesystems/ext4/blockgroup.rst (renamed from Documentation/filesystems/ext4/ondisk/blockgroup.rst)0
-rw-r--r--Documentation/filesystems/ext4/blockmap.rst (renamed from Documentation/filesystems/ext4/ondisk/blockmap.rst)0
-rw-r--r--Documentation/filesystems/ext4/blocks.rst (renamed from Documentation/filesystems/ext4/ondisk/blocks.rst)0
-rw-r--r--Documentation/filesystems/ext4/checksums.rst (renamed from Documentation/filesystems/ext4/ondisk/checksums.rst)2
-rw-r--r--Documentation/filesystems/ext4/directory.rst (renamed from Documentation/filesystems/ext4/ondisk/directory.rst)18
-rw-r--r--Documentation/filesystems/ext4/dynamic.rst (renamed from Documentation/filesystems/ext4/ondisk/dynamic.rst)0
-rw-r--r--Documentation/filesystems/ext4/eainode.rst (renamed from Documentation/filesystems/ext4/ondisk/eainode.rst)0
-rw-r--r--Documentation/filesystems/ext4/ext4.rst613
-rw-r--r--Documentation/filesystems/ext4/globals.rst (renamed from Documentation/filesystems/ext4/ondisk/globals.rst)0
-rw-r--r--Documentation/filesystems/ext4/group_descr.rst (renamed from Documentation/filesystems/ext4/ondisk/group_descr.rst)4
-rw-r--r--Documentation/filesystems/ext4/ifork.rst (renamed from Documentation/filesystems/ext4/ondisk/ifork.rst)8
-rw-r--r--Documentation/filesystems/ext4/index.rst19
-rw-r--r--Documentation/filesystems/ext4/inlinedata.rst (renamed from Documentation/filesystems/ext4/ondisk/inlinedata.rst)0
-rw-r--r--Documentation/filesystems/ext4/inodes.rst (renamed from Documentation/filesystems/ext4/ondisk/inodes.rst)19
-rw-r--r--Documentation/filesystems/ext4/journal.rst (renamed from Documentation/filesystems/ext4/ondisk/journal.rst)32
-rw-r--r--Documentation/filesystems/ext4/mmp.rst (renamed from Documentation/filesystems/ext4/ondisk/mmp.rst)2
-rw-r--r--Documentation/filesystems/ext4/ondisk/index.rst9
-rw-r--r--Documentation/filesystems/ext4/overview.rst (renamed from Documentation/filesystems/ext4/ondisk/overview.rst)0
-rw-r--r--Documentation/filesystems/ext4/special_inodes.rst (renamed from Documentation/filesystems/ext4/ondisk/special_inodes.rst)2
-rw-r--r--Documentation/filesystems/ext4/super.rst (renamed from Documentation/filesystems/ext4/ondisk/super.rst)24
-rw-r--r--Documentation/filesystems/f2fs.txt8
-rw-r--r--Documentation/filesystems/fscrypt.rst179
-rw-r--r--Documentation/filesystems/index.rst21
-rw-r--r--Documentation/filesystems/nfs/00-INDEX26
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.txt6
-rw-r--r--Documentation/filesystems/path-lookup.rst (renamed from Documentation/filesystems/path-lookup.md)913
-rw-r--r--Documentation/filesystems/pohmelfs/design_notes.txt72
-rw-r--r--Documentation/filesystems/pohmelfs/info.txt99
-rw-r--r--Documentation/filesystems/pohmelfs/network_protocol.txt227
-rw-r--r--Documentation/filesystems/porting16
-rw-r--r--Documentation/filesystems/proc.txt27
-rw-r--r--Documentation/filesystems/qnx6.txt4
-rw-r--r--Documentation/filesystems/spufs.txt2
-rw-r--r--Documentation/filesystems/sysfs.txt4
-rw-r--r--Documentation/filesystems/ubifs-authentication.md426
-rw-r--r--Documentation/filesystems/ubifs.txt7
-rw-r--r--Documentation/filesystems/vfs.txt24
-rw-r--r--Documentation/filesystems/xfs-self-describing-metadata.txt2
-rw-r--r--Documentation/filesystems/xfs.txt2
54 files changed, 1213 insertions, 1812 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
deleted file mode 100644
index 0937bade1099..000000000000
--- a/Documentation/filesystems/00-INDEX
+++ /dev/null
@@ -1,153 +0,0 @@
-00-INDEX
- - this file (info on some of the filesystems supported by linux).
-Locking
- - info on locking rules as they pertain to Linux VFS.
-9p.txt
- - 9p (v9fs) is an implementation of the Plan 9 remote fs protocol.
-adfs.txt
- - info and mount options for the Acorn Advanced Disc Filing System.
-afs.txt
- - info and examples for the distributed AFS (Andrew File System) fs.
-affs.txt
- - info and mount options for the Amiga Fast File System.
-autofs-mount-control.txt
- - info on device control operations for autofs module.
-automount-support.txt
- - information about filesystem automount support.
-befs.txt
- - information about the BeOS filesystem for Linux.
-bfs.txt
- - info for the SCO UnixWare Boot Filesystem (BFS).
-btrfs.txt
- - info for the BTRFS filesystem.
-caching/
- - directory containing filesystem cache documentation.
-ceph.txt
- - info for the Ceph Distributed File System.
-cifs/
- - directory containing CIFS filesystem documentation and example code.
-coda.txt
- - description of the CODA filesystem.
-configfs/
- - directory containing configfs documentation and example code.
-cramfs.txt
- - info on the cram filesystem for small storage (ROMs etc).
-dax.txt
- - info on avoiding the page cache for files stored on CPU-addressable
- storage devices.
-debugfs.txt
- - info on the debugfs filesystem.
-devpts.txt
- - info on the devpts filesystem.
-directory-locking
- - info about the locking scheme used for directory operations.
-dlmfs.txt
- - info on the userspace interface to the OCFS2 DLM.
-dnotify.txt
- - info about directory notification in Linux.
-dnotify_test.c
- - example program for dnotify.
-ecryptfs.txt
- - docs on eCryptfs: stacked cryptographic filesystem for Linux.
-efivarfs.txt
- - info for the efivarfs filesystem.
-exofs.txt
- - info, usage, mount options, design about EXOFS.
-ext2.txt
- - info, mount options and specifications for the Ext2 filesystem.
-ext3.txt
- - info, mount options and specifications for the Ext3 filesystem.
-ext4.txt
- - info, mount options and specifications for the Ext4 filesystem.
-f2fs.txt
- - info and mount options for the F2FS filesystem.
-fiemap.txt
- - info on fiemap ioctl.
-files.txt
- - info on file management in the Linux kernel.
-fuse.txt
- - info on the Filesystem in User SpacE including mount options.
-gfs2-glocks.txt
- - info on the Global File System 2 - Glock internal locking rules.
-gfs2-uevents.txt
- - info on the Global File System 2 - uevents.
-gfs2.txt
- - info on the Global File System 2.
-hfs.txt
- - info on the Macintosh HFS Filesystem for Linux.
-hfsplus.txt
- - info on the Macintosh HFSPlus Filesystem for Linux.
-hpfs.txt
- - info and mount options for the OS/2 HPFS.
-inotify.txt
- - info on the powerful yet simple file change notification system.
-isofs.txt
- - info and mount options for the ISO 9660 (CDROM) filesystem.
-jfs.txt
- - info and mount options for the JFS filesystem.
-locks.txt
- - info on file locking implementations, flock() vs. fcntl(), etc.
-mandatory-locking.txt
- - info on the Linux implementation of Sys V mandatory file locking.
-nfs/
- - nfs-related documentation.
-nilfs2.txt
- - info and mount options for the NILFS2 filesystem.
-ntfs.txt
- - info and mount options for the NTFS filesystem (Windows NT).
-ocfs2.txt
- - info and mount options for the OCFS2 clustered filesystem.
-omfs.txt
- - info on the Optimized MPEG FileSystem.
-path-lookup.txt
- - info on path walking and name lookup locking.
-pohmelfs/
- - directory containing pohmelfs filesystem documentation.
-porting
- - various information on filesystem porting.
-proc.txt
- - info on Linux's /proc filesystem.
-qnx6.txt
- - info on the QNX6 filesystem.
-quota.txt
- - info on Quota subsystem.
-ramfs-rootfs-initramfs.txt
- - info on the 'in memory' filesystems ramfs, rootfs and initramfs.
-relay.txt
- - info on relay, for efficient streaming from kernel to user space.
-romfs.txt
- - description of the ROMFS filesystem.
-seq_file.txt
- - how to use the seq_file API.
-sharedsubtree.txt
- - a description of shared subtrees for namespaces.
-spufs.txt
- - info and mount options for the SPU filesystem used on Cell.
-squashfs.txt
- - info on the squashfs filesystem.
-sysfs-pci.txt
- - info on accessing PCI device resources through sysfs.
-sysfs-tagging.txt
- - info on sysfs tagging to avoid duplicates.
-sysfs.txt
- - info on sysfs, a ram-based filesystem for exporting kernel objects.
-sysv-fs.txt
- - info on the SystemV/V7/Xenix/Coherent filesystem.
-tmpfs.txt
- - info on tmpfs, a filesystem that holds all files in virtual memory.
-ubifs.txt
- - info on the Unsorted Block Images FileSystem.
-udf.txt
- - info and mount options for the UDF filesystem.
-ufs.txt
- - info on the ufs filesystem.
-vfat.txt
- - info on using the VFAT filesystem used in Windows NT and Windows 95.
-vfs.txt
- - overview of the Virtual File System.
-xfs-delayed-logging-design.txt
- - info on the XFS Delayed Logging Design.
-xfs-self-describing-metadata.txt
- - info on XFS Self Describing Metadata.
-xfs.txt
- - info and mount options for the XFS filesystem.
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
index c0bd5677271b..c418280c915f 100644
--- a/Documentation/filesystems/caching/backend-api.txt
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -704,7 +704,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
void fscache_get_retrieval(struct fscache_retrieval *op);
void fscache_put_retrieval(struct fscache_retrieval *op);
- These two functions are used to retain a retrieval record whilst doing
+ These two functions are used to retain a retrieval record while doing
asynchronous data retrieval and block allocation.
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
index 748a1ae49e12..28aefcbb1442 100644
--- a/Documentation/filesystems/caching/cachefiles.txt
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -45,7 +45,7 @@ filesystems are very specific in nature.
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
-and whilst it is open, a cache is at least partially in existence. The daemon
+and while it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.
CacheFiles is currently limited to a single cache.
@@ -163,7 +163,7 @@ Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.
-Do not create, rename or unlink files and directories in the cache whilst the
+Do not create, rename or unlink files and directories in the cache while the
cache is active, as this may cause the state to become uncertain.
Renaming files in the cache might make objects appear to be other objects (the
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 2a6f7399c1f3..ba968e8f5704 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -382,7 +382,7 @@ MISCELLANEOUS OBJECT REGISTRATION
An optional step is to request an object of miscellaneous type be created in
the cache. This is almost identical to index cookie acquisition. The only
difference is that the type in the object definition should be something other
-than index type. Whilst the parent object could be an index, it's more likely
+than index type. While the parent object could be an index, it's more likely
it would be some other type of object such as a data file.
xattr->cache =
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
index a1c052cbba35..d8976c434718 100644
--- a/Documentation/filesystems/caching/operations.txt
+++ b/Documentation/filesystems/caching/operations.txt
@@ -171,7 +171,7 @@ Operations are used through the following procedure:
(3) If the submitting thread wants to do the work itself, and has marked the
operation with FSCACHE_OP_MYTHREAD, then it should monitor
FSCACHE_OP_WAITING as described above and check the state of the object if
- necessary (the object might have died whilst the thread was waiting).
+ necessary (the object might have died while the thread was waiting).
When it has finished doing its processing, it should call
fscache_op_complete() and fscache_put_operation() on it.
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt
index 8bf62240e10d..1177052701e1 100644
--- a/Documentation/filesystems/ceph.txt
+++ b/Documentation/filesystems/ceph.txt
@@ -151,6 +151,11 @@ Mount Options
Report overall filesystem usage in statfs instead of using the root
directory quota.
+ nocopyfrom
+ Don't use the RADOS 'copy-from' operation to perform remote object
+ copies. Currently, it's only used in copy_file_range, which will revert
+ to the default VFS implementation if this option is used.
+
More Information
================
diff --git a/Documentation/filesystems/cifs/TODO b/Documentation/filesystems/cifs/TODO
index 852499aed64b..66b3f54aa6dc 100644
--- a/Documentation/filesystems/cifs/TODO
+++ b/Documentation/filesystems/cifs/TODO
@@ -1,4 +1,4 @@
-Version 2.11 September 13, 2017
+Version 2.14 December 21, 2018
A Partial List of Missing Features
==================================
@@ -7,7 +7,7 @@ Contributions are welcome. There are plenty of opportunities
for visible, important contributions to this module. Here
is a partial list of the known problems and missing features:
-a) SMB3 (and SMB3.02) missing optional features:
+a) SMB3 (and SMB3.1.1) missing optional features:
- multichannel (started), integration with RDMA
- directory leases (improved metadata caching), started (root dir only)
- T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl
@@ -21,8 +21,9 @@ using Directory Leases, currently only the root file handle is cached longer
d) quota support (needs minor kernel change since quota calls
to make it to network filesystems or deviceless filesystems)
-e) Compounding (in progress) to reduce number of roundtrips, and also
-better optimize open to reduce redundant opens (using reference counts more).
+e) Additional use cases where we use "compoounding" (e.g. open/query/close
+and open/setinfo/close) to reduce the number of roundtrips, and also
+open to reduce redundant opens (using deferred close and reference counts more).
f) Finish inotify support so kde and gnome file list windows
will autorefresh (partially complete by Asser). Needs minor kernel
@@ -43,11 +44,13 @@ exists. Also better integration with winbind for resolving SID owners
k) Add tools to take advantage of more smb3 specific ioctls and features
(passthrough ioctl/fsctl for sending various SMB3 fsctls to the server
-is in progress)
+is in progress, and a passthrough query_info call is already implemented
+in cifs.ko to allow smb3 info levels queries to be sent from userspace)
l) encrypted file support
-m) improved stats gathering, tools (perhaps integration with nfsometer?)
+m) improved stats gathering tools (perhaps integration with nfsometer?)
+to extend and make easier to use what is currently in /proc/fs/cifs/Stats
n) allow setting more NTFS/SMB3 file attributes remotely (currently limited to compressed
file attribute via chflags) and improve user space tools for managing and
@@ -76,6 +79,9 @@ and simplify the code.
v) POSIX Extensions for SMB3.1.1 (started, create and mkdir support added
so far).
+w) Add support for additional strong encryption types, and additional spnego
+authentication mechanisms (see MS-SMB2)
+
KNOWN BUGS
====================================
See http://bugzilla.samba.org - search on product "CifsVFS" for
@@ -102,3 +108,11 @@ and when signing is disabled to request larger read sizes (larger than
negotiated size) and send larger write sizes to modern servers.
4) More exhaustively test against less common servers
+
+5) Continue to extend the smb3 "buildbot" which does automated xfstesting
+against Windows, Samba and Azure currently - to add additional tests and
+to allow the buildbot to execute the tests faster.
+
+6) Address various coverity warnings (most are not bugs per-se, but
+the more warnings are addressed, the easier it is to spot real
+problems that static analyzers will point out in the future).
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt
index 3828e85345ae..16e606c11f40 100644
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -216,7 +216,7 @@ be called whenever userspace asks for a write(2) on the attribute.
[struct configfs_bin_attribute]
- struct configfs_attribute {
+ struct configfs_bin_attribute {
struct configfs_attribute cb_attr;
void *cb_private;
size_t cb_max_size;
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 70cb68bed2e8..6d2c0d340dea 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -75,7 +75,7 @@ exposure of uninitialized data through mmap.
These filesystems may be used for inspiration:
- ext2: see Documentation/filesystems/ext2.txt
-- ext4: see Documentation/filesystems/ext4.txt
+- ext4: see Documentation/filesystems/ext4/
- xfs: see Documentation/filesystems/xfs.txt
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 81c0becab225..a19973a4dd1e 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -358,7 +358,7 @@ and are copied into the filesystem. If a transaction is incomplete at
the time of the crash, then there is no guarantee of consistency for
the blocks in that transaction so they are discarded (which means any
filesystem changes they represent are also lost).
-Check Documentation/filesystems/ext4.txt if you want to read more about
+Check Documentation/filesystems/ext4/ if you want to read more about
ext4 and journaling.
References
diff --git a/Documentation/filesystems/ext4/ondisk/about.rst b/Documentation/filesystems/ext4/about.rst
index 0aadba052264..0aadba052264 100644
--- a/Documentation/filesystems/ext4/ondisk/about.rst
+++ b/Documentation/filesystems/ext4/about.rst
diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/allocators.rst
index 7aa85152ace3..7aa85152ace3 100644
--- a/Documentation/filesystems/ext4/ondisk/allocators.rst
+++ b/Documentation/filesystems/ext4/allocators.rst
diff --git a/Documentation/filesystems/ext4/ondisk/attributes.rst b/Documentation/filesystems/ext4/attributes.rst
index 0b01b67b81fe..54386a010a8d 100644
--- a/Documentation/filesystems/ext4/ondisk/attributes.rst
+++ b/Documentation/filesystems/ext4/attributes.rst
@@ -30,7 +30,7 @@ Extended attributes, when stored after the inode, have a header
``ext4_xattr_ibody_header`` that is 4 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -47,7 +47,7 @@ The beginning of an extended attribute block is in
``struct ext4_xattr_header``, which is 32 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -92,7 +92,7 @@ entries must be stored in sorted order. The sort order is
Attributes stored inside an inode do not need be stored in sorted order.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -157,7 +157,7 @@ attribute name index field is set, and matching string is removed from
the key name. Here is a map of name index values to key prefixes:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Name Index
diff --git a/Documentation/filesystems/ext4/ondisk/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst
index c6d88557553c..c6d88557553c 100644
--- a/Documentation/filesystems/ext4/ondisk/bigalloc.rst
+++ b/Documentation/filesystems/ext4/bigalloc.rst
diff --git a/Documentation/filesystems/ext4/ondisk/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst
index c7546dbc197a..c7546dbc197a 100644
--- a/Documentation/filesystems/ext4/ondisk/bitmaps.rst
+++ b/Documentation/filesystems/ext4/bitmaps.rst
diff --git a/Documentation/filesystems/ext4/ondisk/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index baf888e4c06a..baf888e4c06a 100644
--- a/Documentation/filesystems/ext4/ondisk/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
diff --git a/Documentation/filesystems/ext4/ondisk/blockmap.rst b/Documentation/filesystems/ext4/blockmap.rst
index 30e25750d88a..30e25750d88a 100644
--- a/Documentation/filesystems/ext4/ondisk/blockmap.rst
+++ b/Documentation/filesystems/ext4/blockmap.rst
diff --git a/Documentation/filesystems/ext4/ondisk/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
index 73d4dc0f7bda..73d4dc0f7bda 100644
--- a/Documentation/filesystems/ext4/ondisk/blocks.rst
+++ b/Documentation/filesystems/ext4/blocks.rst
diff --git a/Documentation/filesystems/ext4/ondisk/checksums.rst b/Documentation/filesystems/ext4/checksums.rst
index 9d6a793b2e03..5519e253810d 100644
--- a/Documentation/filesystems/ext4/ondisk/checksums.rst
+++ b/Documentation/filesystems/ext4/checksums.rst
@@ -28,7 +28,7 @@ of checksum. The checksum function is whatever the superblock describes
(crc32c as of October 2013) unless noted otherwise.
.. list-table::
- :widths: 1 1 4
+ :widths: 20 8 50
:header-rows: 1
* - Metadata
diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/directory.rst
index 8fcba68c2884..614034e24669 100644
--- a/Documentation/filesystems/ext4/ondisk/directory.rst
+++ b/Documentation/filesystems/ext4/directory.rst
@@ -34,7 +34,7 @@ is at most 263 bytes long, though on disk you'll need to reference
``dirent.rec_len`` to know for sure.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -66,7 +66,7 @@ tree traversal. This format is ``ext4_dir_entry_2``, which is at most
``dirent.rec_len`` to know for sure.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -99,7 +99,7 @@ tree traversal. This format is ``ext4_dir_entry_2``, which is at most
The directory file type is one of the following values:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -130,7 +130,7 @@ in the place where the name normally goes. The structure is
``struct ext4_dir_entry_tail``:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -212,7 +212,7 @@ The root of the htree is in ``struct dx_root``, which is the full length
of a data block:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -305,7 +305,7 @@ of a data block:
The directory hash is one of the following values:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -327,7 +327,7 @@ Interior nodes of an htree are recorded as ``struct dx_node``, which is
also the full length of a data block:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -375,7 +375,7 @@ The hash maps that exist in both ``struct dx_root`` and
long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -405,7 +405,7 @@ directory index (which will ensure that there's space for the checksum.
The dx\_tail structure is 8 bytes long and looks like this:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/dynamic.rst
index bb0c84333341..bb0c84333341 100644
--- a/Documentation/filesystems/ext4/ondisk/dynamic.rst
+++ b/Documentation/filesystems/ext4/dynamic.rst
diff --git a/Documentation/filesystems/ext4/ondisk/eainode.rst b/Documentation/filesystems/ext4/eainode.rst
index ecc0d01a0a72..ecc0d01a0a72 100644
--- a/Documentation/filesystems/ext4/ondisk/eainode.rst
+++ b/Documentation/filesystems/ext4/eainode.rst
diff --git a/Documentation/filesystems/ext4/ext4.rst b/Documentation/filesystems/ext4/ext4.rst
deleted file mode 100644
index 9d4368d591fa..000000000000
--- a/Documentation/filesystems/ext4/ext4.rst
+++ /dev/null
@@ -1,613 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-========================
-General Information
-========================
-
-Ext4 is an advanced level of the ext3 filesystem which incorporates
-scalability and reliability enhancements for supporting large filesystems
-(64 bit) in keeping with increasing disk capacities and state-of-the-art
-feature requirements.
-
-Mailing list: linux-ext4@vger.kernel.org
-Web site: http://ext4.wiki.kernel.org
-
-
-Quick usage instructions
-========================
-
-Note: More extensive information for getting started with ext4 can be
-found at the ext4 wiki site at the URL:
-http://ext4.wiki.kernel.org/index.php/Ext4_Howto
-
- - The latest version of e2fsprogs can be found at:
-
- https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
-
- or
-
- http://sourceforge.net/project/showfiles.php?group_id=2406
-
- or grab the latest git repository from:
-
- https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
-
- - Create a new filesystem using the ext4 filesystem type:
-
- # mke2fs -t ext4 /dev/hda1
-
- Or to configure an existing ext3 filesystem to support extents:
-
- # tune2fs -O extents /dev/hda1
-
- If the filesystem was created with 128 byte inodes, it can be
- converted to use 256 byte for greater efficiency via:
-
- # tune2fs -I 256 /dev/hda1
-
- - Mounting:
-
- # mount -t ext4 /dev/hda1 /wherever
-
- - When comparing performance with other filesystems, it's always
- important to try multiple workloads; very often a subtle change in a
- workload parameter can completely change the ranking of which
- filesystems do well compared to others. When comparing versus ext3,
- note that ext4 enables write barriers by default, while ext3 does
- not enable write barriers by default. So it is useful to use
- explicitly specify whether barriers are enabled or not when via the
- '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
- for a fair comparison. When tuning ext3 for best benchmark numbers,
- it is often worthwhile to try changing the data journaling mode; '-o
- data=writeback' can be faster for some workloads. (Note however that
- running mounted with data=writeback can potentially leave stale data
- exposed in recently written files in case of an unclean shutdown,
- which could be a security exposure in some situations.) Configuring
- the filesystem with a large journal can also be helpful for
- metadata-intensive workloads.
-
-Features
-========
-
-Currently Available
--------------------
-
-* ability to use filesystems > 16TB (e2fsprogs support not available yet)
-* extent format reduces metadata overhead (RAM, IO for access, transactions)
-* extent format more robust in face of on-disk corruption due to magics,
-* internal redundancy in tree
-* improved file allocation (multi-block alloc)
-* lift 32000 subdirectory limit imposed by i_links_count[1]
-* nsec timestamps for mtime, atime, ctime, create time
-* inode version field on disk (NFSv4, Lustre)
-* reduced e2fsck time via uninit_bg feature
-* journal checksumming for robustness, performance
-* persistent file preallocation (e.g for streaming media, databases)
-* ability to pack bitmaps and inode tables into larger virtual groups via the
- flex_bg feature
-* large file support
-* inode allocation using large virtual block groups via flex_bg
-* delayed allocation
-* large block (up to pagesize) support
-* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
- the ordering)
-
-[1] Filesystems with a block size of 1k may see a limit imposed by the
-directory hash tree having a maximum depth of two.
-
-Options
-=======
-
-When mounting an ext4 filesystem, the following option are accepted:
-(*) == default
-
-======================= =======================================================
-Mount Option Description
-======================= =======================================================
-ro Mount filesystem read only. Note that ext4 will
- replay the journal (and thus write to the
- partition) even when mounted "read only". The
- mount options "ro,noload" can be used to prevent
- writes to the filesystem.
-
-journal_checksum Enable checksumming of the journal transactions.
- This will allow the recovery code in e2fsck and the
- kernel to detect corruption in the kernel. It is a
- compatible change and will be ignored by older kernels.
-
-journal_async_commit Commit block can be written to disk without waiting
- for descriptor blocks. If enabled older kernels cannot
- mount the device. This will enable 'journal_checksum'
- internally.
-
-journal_path=path
-journal_dev=devnum When the external journal device's major/minor numbers
- have changed, these options allow the user to specify
- the new journal location. The journal device is
- identified through either its new major/minor numbers
- encoded in devnum, or via a path to the device.
-
-norecovery Don't load the journal on mounting. Note that
-noload if the filesystem was not unmounted cleanly,
- skipping the journal replay will lead to the
- filesystem containing inconsistencies that can
- lead to any number of problems.
-
-data=journal All data are committed into the journal prior to being
- written into the main file system. Enabling
- this mode will disable delayed allocation and
- O_DIRECT support.
-
-data=ordered (*) All data are forced directly out to the main file
- system prior to its metadata being committed to the
- journal.
-
-data=writeback Data ordering is not preserved, data may be written
- into the main file system after its metadata has been
- committed to the journal.
-
-commit=nrsec (*) Ext4 can be told to sync all its data and metadata
- every 'nrsec' seconds. The default value is 5 seconds.
- This means that if you lose your power, you will lose
- as much as the latest 5 seconds of work (your
- filesystem will not be damaged though, thanks to the
- journaling). This default value (or any low value)
- will hurt performance, but it's good for data-safety.
- Setting it to 0 will have the same effect as leaving
- it at the default (5 seconds).
- Setting it to very large values will improve
- performance.
-
-barrier=<0|1(*)> This enables/disables the use of write barriers in
-barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
-nobarrier This also requires an IO stack which can support
- barriers, and if jbd gets an error on a barrier
- write, it will disable again with a warning.
- Write barriers enforce proper on-disk ordering
- of journal commits, making volatile disk write caches
- safe to use, at some performance penalty. If
- your disks are battery-backed in one way or another,
- disabling barriers may safely improve performance.
- The mount options "barrier" and "nobarrier" can
- also be used to enable or disable barriers, for
- consistency with other ext4 mount options.
-
-inode_readahead_blks=n This tuning parameter controls the maximum
- number of inode table blocks that ext4's inode
- table readahead algorithm will pre-read into
- the buffer cache. The default value is 32 blocks.
-
-nouser_xattr Disables Extended User Attributes. See the
- attr(5) manual page for more information about
- extended attributes.
-
-noacl This option disables POSIX Access Control List
- support. If ACL support is enabled in the kernel
- configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
- enabled by default on mount. See the acl(5) manual
- page for more information about acl.
-
-bsddf (*) Make 'df' act like BSD.
-minixdf Make 'df' act like Minix.
-
-debug Extra debugging information is sent to syslog.
-
-abort Simulate the effects of calling ext4_abort() for
- debugging purposes. This is normally used while
- remounting a filesystem which is already mounted.
-
-errors=remount-ro Remount the filesystem read-only on an error.
-errors=continue Keep going on a filesystem error.
-errors=panic Panic and halt the machine if an error occurs.
- (These mount options override the errors behavior
- specified in the superblock, which can be configured
- using tune2fs)
-
-data_err=ignore(*) Just print an error message if an error occurs
- in a file data buffer in ordered mode.
-data_err=abort Abort the journal if an error occurs in a file
- data buffer in ordered mode.
-
-grpid New objects have the group ID of their parent.
-bsdgroups
-
-nogrpid (*) New objects have the group ID of their creator.
-sysvgroups
-
-resgid=n The group ID which may use the reserved blocks.
-
-resuid=n The user ID which may use the reserved blocks.
-
-sb=n Use alternate superblock at this location.
-
-quota These options are ignored by the filesystem. They
-noquota are used only by quota tools to recognize volumes
-grpquota where quota should be turned on. See documentation
-usrquota in the quota-tools package for more details
- (http://sourceforge.net/projects/linuxquota).
-
-jqfmt=<quota type> These options tell filesystem details about quota
-usrjquota=<file> so that quota information can be properly updated
-grpjquota=<file> during journal replay. They replace the above
- quota options. See documentation in the quota-tools
- package for more details
- (http://sourceforge.net/projects/linuxquota).
-
-stripe=n Number of filesystem blocks that mballoc will try
- to use for allocation size and alignment. For RAID5/6
- systems this should be the number of data
- disks * RAID chunk size in file system blocks.
-
-delalloc (*) Defer block allocation until just before ext4
- writes out the block(s) in question. This
- allows ext4 to better allocation decisions
- more efficiently.
-nodelalloc Disable delayed allocation. Blocks are allocated
- when the data is copied from userspace to the
- page cache, either via the write(2) system call
- or when an mmap'ed page which was previously
- unallocated is written for the first time.
-
-max_batch_time=usec Maximum amount of time ext4 should wait for
- additional filesystem operations to be batch
- together with a synchronous write operation.
- Since a synchronous write operation is going to
- force a commit and then a wait for the I/O
- complete, it doesn't cost much, and can be a
- huge throughput win, we wait for a small amount
- of time to see if any other transactions can
- piggyback on the synchronous write. The
- algorithm used is designed to automatically tune
- for the speed of the disk, by measuring the
- amount of time (on average) that it takes to
- finish committing a transaction. Call this time
- the "commit time". If the time that the
- transaction has been running is less than the
- commit time, ext4 will try sleeping for the
- commit time to see if other operations will join
- the transaction. The commit time is capped by
- the max_batch_time, which defaults to 15000us
- (15ms). This optimization can be turned off
- entirely by setting max_batch_time to 0.
-
-min_batch_time=usec This parameter sets the commit time (as
- described above) to be at least min_batch_time.
- It defaults to zero microseconds. Increasing
- this parameter may improve the throughput of
- multi-threaded, synchronous workloads on very
- fast disks, at the cost of increasing latency.
-
-journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
- highest priority) which should be used for I/O
- operations submitted by kjournald2 during a
- commit operation. This defaults to 3, which is
- a slightly higher priority than the default I/O
- priority.
-
-auto_da_alloc(*) Many broken applications don't use fsync() when
-noauto_da_alloc replacing existing files via patterns such as
- fd = open("foo.new")/write(fd,..)/close(fd)/
- rename("foo.new", "foo"), or worse yet,
- fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
- If auto_da_alloc is enabled, ext4 will detect
- the replace-via-rename and replace-via-truncate
- patterns and force that any delayed allocation
- blocks are allocated such that at the next
- journal commit, in the default data=ordered
- mode, the data blocks of the new file are forced
- to disk before the rename() operation is
- committed. This provides roughly the same level
- of guarantees as ext3, and avoids the
- "zero-length" problem that can happen when a
- system crashes before the delayed allocation
- blocks are forced to disk.
-
-noinit_itable Do not initialize any uninitialized inode table
- blocks in the background. This feature may be
- used by installation CD's so that the install
- process can complete as quickly as possible; the
- inode table initialization process would then be
- deferred until the next time the file system
- is unmounted.
-
-init_itable=n The lazy itable init code will wait n times the
- number of milliseconds it took to zero out the
- previous block group's inode table. This
- minimizes the impact on the system performance
- while file system's inode table is being initialized.
-
-discard Controls whether ext4 should issue discard/TRIM
-nodiscard(*) commands to the underlying block device when
- blocks are freed. This is useful for SSD devices
- and sparse/thinly-provisioned LUNs, but it is off
- by default until sufficient testing has been done.
-
-nouid32 Disables 32-bit UIDs and GIDs. This is for
- interoperability with older kernels which only
- store and expect 16-bit values.
-
-block_validity(*) These options enable or disable the in-kernel
-noblock_validity facility for tracking filesystem metadata blocks
- within internal data structures. This allows multi-
- block allocator and other routines to notice
- bugs or corrupted allocation bitmaps which cause
- blocks to be allocated which overlap with
- filesystem metadata blocks.
-
-dioread_lock Controls whether or not ext4 should use the DIO read
-dioread_nolock locking. If the dioread_nolock option is specified
- ext4 will allocate uninitialized extent before buffer
- write and convert the extent to initialized after IO
- completes. This approach allows ext4 code to avoid
- using inode mutex, which improves scalability on high
- speed storages. However this does not work with
- data journaling and dioread_nolock option will be
- ignored with kernel warning. Note that dioread_nolock
- code path is only used for extent-based files.
- Because of the restrictions this options comprises
- it is off by default (e.g. dioread_lock).
-
-max_dir_size_kb=n This limits the size of directories so that any
- attempt to expand them beyond the specified
- limit in kilobytes will cause an ENOSPC error.
- This is useful in memory constrained
- environments, where a very large directory can
- cause severe performance problems or even
- provoke the Out Of Memory killer. (For example,
- if there is only 512mb memory available, a 176mb
- directory may seriously cramp the system's style.)
-
-i_version Enable 64-bit inode version support. This option is
- off by default.
-
-dax Use direct access (no page cache). See
- Documentation/filesystems/dax.txt. Note that
- this option is incompatible with data=journal.
-======================= =======================================================
-
-Data Mode
-=========
-There are 3 different data modes:
-
-* writeback mode
-
- In data=writeback mode, ext4 does not journal data at all. This mode provides
- a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
- mode - metadata journaling. A crash+recovery can cause incorrect data to
- appear in files which were written shortly before the crash. This mode will
- typically provide the best ext4 performance.
-
-* ordered mode
-
- In data=ordered mode, ext4 only officially journals metadata, but it logically
- groups metadata information related to data changes with the data blocks into
- a single unit called a transaction. When it's time to write the new metadata
- out to disk, the associated data blocks are written first. In general, this
- mode performs slightly slower than writeback but significantly faster than
- journal mode.
-
-* journal mode
-
- data=journal mode provides full data and metadata journaling. All new data is
- written to the journal first, and then to its final location. In the event of
- a crash, the journal can be replayed, bringing both data and metadata into a
- consistent state. This mode is the slowest except when data needs to be read
- from and written to disk at the same time where it outperforms all others
- modes. Enabling this mode will disable delayed allocation and O_DIRECT
- support.
-
-/proc entries
-=============
-
-Information about mounted ext4 file systems can be found in
-/proc/fs/ext4. Each mounted filesystem will have a directory in
-/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
-/proc/fs/ext4/dm-0). The files in each per-device directory are shown
-in table below.
-
-Files in /proc/fs/ext4/<devname>
-
-================ =======
- File Content
-================ =======
- mb_groups details of multiblock allocator buddy cache of free blocks
-================ =======
-
-/sys entries
-============
-
-Information about mounted ext4 file systems can be found in
-/sys/fs/ext4. Each mounted filesystem will have a directory in
-/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
-/sys/fs/ext4/dm-0). The files in each per-device directory are shown
-in table below.
-
-Files in /sys/fs/ext4/<devname>:
-
-(see also Documentation/ABI/testing/sysfs-fs-ext4)
-
-============================= =================================================
-File Content
-============================= =================================================
- delayed_allocation_blocks This file is read-only and shows the number of
- blocks that are dirty in the page cache, but
- which do not have their location in the
- filesystem allocated yet.
-
-inode_goal Tuning parameter which (if non-zero) controls
- the goal inode used by the inode allocator in
- preference to all other allocation heuristics.
- This is intended for debugging use only, and
- should be 0 on production systems.
-
-inode_readahead_blks Tuning parameter which controls the maximum
- number of inode table blocks that ext4's inode
- table readahead algorithm will pre-read into
- the buffer cache
-
-lifetime_write_kbytes This file is read-only and shows the number of
- kilobytes of data that have been written to this
- filesystem since it was created.
-
- max_writeback_mb_bump The maximum number of megabytes the writeback
- code will try to write out before move on to
- another inode.
-
- mb_group_prealloc The multiblock allocator will round up allocation
- requests to a multiple of this tuning parameter if
- the stripe size is not set in the ext4 superblock
-
- mb_max_to_scan The maximum number of extents the multiblock
- allocator will search to find the best extent
-
- mb_min_to_scan The minimum number of extents the multiblock
- allocator will search to find the best extent
-
- mb_order2_req Tuning parameter which controls the minimum size
- for requests (as a power of 2) where the buddy
- cache is used
-
- mb_stats Controls whether the multiblock allocator should
- collect statistics, which are shown during the
- unmount. 1 means to collect statistics, 0 means
- not to collect statistics
-
- mb_stream_req Files which have fewer blocks than this tunable
- parameter will have their blocks allocated out
- of a block group specific preallocation pool, so
- that small files are packed closely together.
- Each large file will have its blocks allocated
- out of its own unique preallocation pool.
-
- session_write_kbytes This file is read-only and shows the number of
- kilobytes of data that have been written to this
- filesystem since it was mounted.
-
- reserved_clusters This is RW file and contains number of reserved
- clusters in the file system which will be used
- in the specific situations to avoid costly
- zeroout, unexpected ENOSPC, or possible data
- loss. The default is 2% or 4096 clusters,
- whichever is smaller and this can be changed
- however it can never exceed number of clusters
- in the file system. If there is not enough space
- for the reserved space when mounting the file
- mount will _not_ fail.
-============================= =================================================
-
-Ioctls
-======
-
-There is some Ext4 specific functionality which can be accessed by applications
-through the system call interfaces. The list of all Ext4 specific ioctls are
-shown in the table below.
-
-Table of Ext4 specific ioctls
-
-============================= =================================================
-Ioctl Description
-============================= =================================================
- EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
- The ioctl argument is an integer bitfield, with
- bit values described in ext4.h. This ioctl is an
- alias for FS_IOC_GETFLAGS.
-
- EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
- The ioctl argument is an integer bitfield, with
- bit values described in ext4.h. This ioctl is an
- alias for FS_IOC_SETFLAGS.
-
- EXT4_IOC_GETVERSION
- EXT4_IOC_GETVERSION_OLD
- Get the inode i_generation number stored for
- each inode. The i_generation number is normally
- changed only when new inode is created and it is
- particularly useful for network filesystems. The
- '_OLD' version of this ioctl is an alias for
- FS_IOC_GETVERSION.
-
- EXT4_IOC_SETVERSION
- EXT4_IOC_SETVERSION_OLD
- Set the inode i_generation number stored for
- each inode. The '_OLD' version of this ioctl
- is an alias for FS_IOC_SETVERSION.
-
- EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
- mount option. It allows to resize filesystem
- to the end of the last existing block group,
- further resize has to be done with resize2fs,
- either online, or offline. The argument points
- to the unsigned logn number representing the
- filesystem new block count.
-
- EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
- this ioctl is pointing to) to the donor_fd (the
- one specified in move_extent structure passed
- as an argument to this ioctl). Then, exchange
- inode metadata between orig_fd and donor_fd.
- This is especially useful for online
- defragmentation, because the allocator has the
- opportunity to allocate moved blocks better,
- ideally into one contiguous extent.
-
- EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
- new group descriptor block. The new group
- descriptor is described by ext4_new_group_input
- structure, which is passed as an argument to
- this ioctl. This is especially useful in
- conjunction with EXT4_IOC_GROUP_EXTEND,
- which allows online resize of the filesystem
- to the end of the last existing block group.
- Those two ioctls combined is used in userspace
- online resize tool (e.g. resize2fs).
-
- EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
- It converts (migrates) ext3 indirect block mapped
- inode to ext4 extent mapped inode by walking
- through indirect block mapping of the original
- inode and converting contiguous block ranges
- into ext4 extents of the temporary inode. Then,
- inodes are swapped. This ioctl might help, when
- migrating from ext3 to ext4 filesystem, however
- suggestion is to create fresh ext4 filesystem
- and copy data from the backup. Note, that
- filesystem has to support extents for this ioctl
- to work.
-
- EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
- allocated to preserve application-expected ext3
- behaviour. Note that this will also start
- triggering a write of the data blocks, but this
- behaviour may change in the future as it is
- not necessary and has been done this way only
- for sake of simplicity.
-
- EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number
- of blocks of resized filesystem is passed in via
- 64 bit integer argument. The kernel allocates
- bitmaps and inode table, the userspace tool thus
- just passes the new number of blocks.
-
- EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
- (like i_blocks, i_size, i_flags, ...) from
- the specified inode with inode
- EXT4_BOOT_LOADER_INO (#5). This is typically
- used to store a boot loader in a secure part of
- the filesystem, where it can't be changed by a
- normal user by accident.
- The data blocks of the previous boot loader
- will be associated with the given inode.
-============================= =================================================
-
-References
-==========
-
-kernel source: <file:fs/ext4/>
- <file:fs/jbd2/>
-
-programs: http://e2fsprogs.sourceforge.net/
-
-useful links: http://fedoraproject.org/wiki/ext3-devel
- http://www.bullopensource.org/ext4/
- http://ext4.wiki.kernel.org/index.php/Main_Page
- http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/filesystems/ext4/ondisk/globals.rst b/Documentation/filesystems/ext4/globals.rst
index 368bf7662b96..368bf7662b96 100644
--- a/Documentation/filesystems/ext4/ondisk/globals.rst
+++ b/Documentation/filesystems/ext4/globals.rst
diff --git a/Documentation/filesystems/ext4/ondisk/group_descr.rst b/Documentation/filesystems/ext4/group_descr.rst
index 759827e5d2cf..0f783ed88592 100644
--- a/Documentation/filesystems/ext4/ondisk/group_descr.rst
+++ b/Documentation/filesystems/ext4/group_descr.rst
@@ -43,7 +43,7 @@ entire bitmap.
The block group descriptor is laid out in ``struct ext4_group_desc``.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -157,7 +157,7 @@ The block group descriptor is laid out in ``struct ext4_group_desc``.
Block group flags can be any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
diff --git a/Documentation/filesystems/ext4/ondisk/ifork.rst b/Documentation/filesystems/ext4/ifork.rst
index 5dbe3b2b121a..b9816d5a896b 100644
--- a/Documentation/filesystems/ext4/ondisk/ifork.rst
+++ b/Documentation/filesystems/ext4/ifork.rst
@@ -68,7 +68,7 @@ The extent tree header is recorded in ``struct ext4_extent_header``,
which is 12 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -104,7 +104,7 @@ Internal nodes of the extent tree, also known as index nodes, are
recorded as ``struct ext4_extent_idx``, and are 12 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -134,7 +134,7 @@ Leaf nodes of the extent tree are recorded as ``struct ext4_extent``,
and are also 12 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -174,7 +174,7 @@ including) the checksum itself.
``struct ext4_extent_tail`` is 4 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
index 71121605558c..3be3e54d480d 100644
--- a/Documentation/filesystems/ext4/index.rst
+++ b/Documentation/filesystems/ext4/index.rst
@@ -1,17 +1,14 @@
.. SPDX-License-Identifier: GPL-2.0
-===============
-ext4 Filesystem
-===============
-
-General usage and on-disk artifacts writen by ext4. More documentation may
-be ported from the wiki as time permits. This should be considered the
-canonical source of information as the details here have been reviewed by
-the ext4 community.
+===================================
+ext4 Data Structures and Algorithms
+===================================
.. toctree::
- :maxdepth: 5
+ :maxdepth: 6
:numbered:
- ext4
- ondisk/index
+ about.rst
+ overview.rst
+ globals.rst
+ dynamic.rst
diff --git a/Documentation/filesystems/ext4/ondisk/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst
index d1075178ce0b..d1075178ce0b 100644
--- a/Documentation/filesystems/ext4/ondisk/inlinedata.rst
+++ b/Documentation/filesystems/ext4/inlinedata.rst
diff --git a/Documentation/filesystems/ext4/ondisk/inodes.rst b/Documentation/filesystems/ext4/inodes.rst
index 655ce898f3f5..6bd35e506b6f 100644
--- a/Documentation/filesystems/ext4/ondisk/inodes.rst
+++ b/Documentation/filesystems/ext4/inodes.rst
@@ -29,8 +29,9 @@ and the inode structure itself.
The inode table entry is laid out in ``struct ext4_inode``.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
+ :class: longtable
* - Offset
- Size
@@ -176,7 +177,7 @@ The inode table entry is laid out in ``struct ext4_inode``.
The ``i_mode`` value is a combination of the following flags:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -227,7 +228,7 @@ The ``i_mode`` value is a combination of the following flags:
The ``i_flags`` field is a combination of these values:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -314,7 +315,7 @@ The ``osd1`` field has multiple meanings depending on the creator:
Linux:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -331,7 +332,7 @@ Linux:
Hurd:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -346,7 +347,7 @@ Hurd:
Masix:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -365,7 +366,7 @@ The ``osd2`` field has multiple meanings depending on the filesystem creator:
Linux:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -402,7 +403,7 @@ Linux:
Hurd:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -433,7 +434,7 @@ Hurd:
Masix:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
diff --git a/Documentation/filesystems/ext4/ondisk/journal.rst b/Documentation/filesystems/ext4/journal.rst
index e7031af86876..ea613ee701f5 100644
--- a/Documentation/filesystems/ext4/ondisk/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -48,7 +48,7 @@ Layout
Generally speaking, the journal has this format:
.. list-table::
- :widths: 1 1 78
+ :widths: 16 48 16
:header-rows: 1
* - Superblock
@@ -76,7 +76,7 @@ The journal superblock will be in the next full block after the
superblock.
.. list-table::
- :widths: 1 1 1 1 76
+ :widths: 12 12 12 32 12
:header-rows: 1
* - 1024 bytes of padding
@@ -98,7 +98,7 @@ Every block in the journal starts with a common 12-byte header
``struct journal_header_s``:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -124,7 +124,7 @@ Every block in the journal starts with a common 12-byte header
The journal block type can be any one of:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -154,7 +154,7 @@ The journal superblock is recorded as ``struct journal_superblock_s``,
which is 1024 bytes long:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -264,7 +264,7 @@ which is 1024 bytes long:
The journal compat features are any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -278,7 +278,7 @@ The journal compat features are any combination of the following:
The journal incompat features are any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -306,7 +306,7 @@ Journal checksum type codes are one of the following. crc32 or crc32c are the
most likely choices.
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -330,7 +330,7 @@ described by a data structure, but here is the block structure anyway.
Descriptor blocks consume at least 36 bytes, but use a full block:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -355,7 +355,7 @@ defined as ``struct journal_block_tag3_s``, which looks like the
following. The size is 16 or 32 bytes.
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -400,7 +400,7 @@ following. The size is 16 or 32 bytes.
The journal tag flags are any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -421,7 +421,7 @@ is defined as ``struct journal_block_tag_s``, which looks like the
following. The size is 8, 12, 24, or 28 bytes:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -471,7 +471,7 @@ JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
``struct jbd2_journal_block_tail``, which looks like this:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -513,7 +513,7 @@ Revocation blocks are described in
length, but use a full block:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -543,7 +543,7 @@ JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
block is a ``struct jbd2_journal_revoke_tail``, which has this format:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -567,7 +567,7 @@ The commit block is described by ``struct commit_header``, which is 32
bytes long (but uses a full block):
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
diff --git a/Documentation/filesystems/ext4/ondisk/mmp.rst b/Documentation/filesystems/ext4/mmp.rst
index b7d7a3137f80..25660981d93c 100644
--- a/Documentation/filesystems/ext4/ondisk/mmp.rst
+++ b/Documentation/filesystems/ext4/mmp.rst
@@ -32,7 +32,7 @@ The checksum is calculated against the FS UUID and the MMP structure.
The MMP structure (``struct mmp_struct``) is as follows:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 12 20 40
:header-rows: 1
* - Offset
diff --git a/Documentation/filesystems/ext4/ondisk/index.rst b/Documentation/filesystems/ext4/ondisk/index.rst
deleted file mode 100644
index f7d082c3a435..000000000000
--- a/Documentation/filesystems/ext4/ondisk/index.rst
+++ /dev/null
@@ -1,9 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============================
-Data Structures and Algorithms
-==============================
-.. include:: about.rst
-.. include:: overview.rst
-.. include:: globals.rst
-.. include:: dynamic.rst
diff --git a/Documentation/filesystems/ext4/ondisk/overview.rst b/Documentation/filesystems/ext4/overview.rst
index cbab18baba12..cbab18baba12 100644
--- a/Documentation/filesystems/ext4/ondisk/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
diff --git a/Documentation/filesystems/ext4/ondisk/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst
index a82f70c9baeb..9061aabba827 100644
--- a/Documentation/filesystems/ext4/ondisk/special_inodes.rst
+++ b/Documentation/filesystems/ext4/special_inodes.rst
@@ -6,7 +6,7 @@ Special inodes
ext4 reserves some inode for special features, as follows:
.. list-table::
- :widths: 1 79
+ :widths: 6 70
:header-rows: 1
* - inode Number
diff --git a/Documentation/filesystems/ext4/ondisk/super.rst b/Documentation/filesystems/ext4/super.rst
index 5f81dd87e0b9..04ff079a2acf 100644
--- a/Documentation/filesystems/ext4/ondisk/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -19,7 +19,7 @@ The ext4 superblock is laid out as follows in
``struct ext4_super_block``:
.. list-table::
- :widths: 1 1 1 77
+ :widths: 8 8 24 40
:header-rows: 1
* - Offset
@@ -483,7 +483,7 @@ The ext4 superblock is laid out as follows in
The superblock state is some combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -500,7 +500,7 @@ The superblock state is some combination of the following:
The superblock error policy is one of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -517,7 +517,7 @@ The superblock error policy is one of the following:
The filesystem creator is one of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -538,7 +538,7 @@ The filesystem creator is one of the following:
The superblock revision is one of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -556,7 +556,7 @@ The superblock compatible features field is a combination of any of the
following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -595,7 +595,7 @@ The superblock incompatible features field is a combination of any of the
following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -647,7 +647,7 @@ The superblock read-only compatible features field is a combination of any of
the following:
.. list-table::
- :widths: 1 79
+ :widths: 16 64
:header-rows: 1
* - Value
@@ -702,7 +702,7 @@ the following:
The ``s_def_hash_version`` field is one of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -725,7 +725,7 @@ The ``s_def_hash_version`` field is one of the following:
The ``s_default_mount_opts`` field is any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -767,7 +767,7 @@ The ``s_default_mount_opts`` field is any combination of the following:
The ``s_flags`` field is any combination of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
@@ -784,7 +784,7 @@ The ``s_flags`` field is any combination of the following:
The ``s_encrypt_algos`` list can contain any of the following:
.. list-table::
- :widths: 1 79
+ :widths: 8 72
:header-rows: 1
* - Value
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
index e5edd29687b5..e46c2147ddf8 100644
--- a/Documentation/filesystems/f2fs.txt
+++ b/Documentation/filesystems/f2fs.txt
@@ -172,9 +172,10 @@ fault_type=%d Support configuring fault injection type, should be
FAULT_DIR_DEPTH 0x000000100
FAULT_EVICT_INODE 0x000000200
FAULT_TRUNCATE 0x000000400
- FAULT_IO 0x000000800
+ FAULT_READ_IO 0x000000800
FAULT_CHECKPOINT 0x000001000
FAULT_DISCARD 0x000002000
+ FAULT_WRITE_IO 0x000004000
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@@ -211,6 +212,11 @@ fsync_mode=%s Control the policy of fsync. Currently supports "posix",
non-atomic files likewise "nobarrier" mount option.
test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt
context. The fake fscrypt context is used by xfstests.
+checkpoint=%s Set to "disable" to turn off checkpointing. Set to "enable"
+ to reenable checkpointing. Is enabled by default. While
+ disabled, any unmounting or unexpected shutdowns will cause
+ the filesystem contents to appear as they did when the
+ filesystem was mounted with that option.
================================================================================
DEBUGFS ENTRIES
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index cfbc18f0d9c9..3a7b60521b94 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -132,47 +132,28 @@ designed for this purpose be used, such as scrypt, PBKDF2, or Argon2.
Per-file keys
-------------
-Master keys are not used to encrypt file contents or names directly.
-Instead, a unique key is derived for each encrypted file, including
-each regular file, directory, and symbolic link. This has several
-advantages:
-
-- In cryptosystems, the same key material should never be used for
- different purposes. Using the master key as both an XTS key for
- contents encryption and as a CTS-CBC key for filenames encryption
- would violate this rule.
-- Per-file keys simplify the choice of IVs (Initialization Vectors)
- for contents encryption. Without per-file keys, to ensure IV
- uniqueness both the inode and logical block number would need to be
- encoded in the IVs. This would make it impossible to renumber
- inodes, which e.g. ``resize2fs`` can do when resizing an ext4
- filesystem. With per-file keys, it is sufficient to encode just the
- logical block number in the IVs.
-- Per-file keys strengthen the encryption of filenames, where IVs are
- reused out of necessity. With a unique key per directory, IV reuse
- is limited to within a single directory.
-- Per-file keys allow individual files to be securely erased simply by
- securely erasing their keys. (Not yet implemented.)
-
-A KDF (Key Derivation Function) is used to derive per-file keys from
-the master key. This is done instead of wrapping a randomly-generated
-key for each file because it reduces the size of the encryption xattr,
-which for some filesystems makes the xattr more likely to fit in-line
-in the filesystem's inode table. With a KDF, only a 16-byte nonce is
-required --- long enough to make key reuse extremely unlikely. A
-wrapped key, on the other hand, would need to be up to 64 bytes ---
-the length of an AES-256-XTS key. Furthermore, currently there is no
-requirement to support unlocking a file with multiple alternative
-master keys or to support rotating master keys. Instead, the master
-keys may be wrapped in userspace, e.g. as done by the `fscrypt
-<https://github.com/google/fscrypt>`_ tool.
-
-The current KDF encrypts the master key using the 16-byte nonce as an
-AES-128-ECB key. The output is used as the derived key. If the
-output is longer than needed, then it is truncated to the needed
-length. Truncation is the norm for directories and symlinks, since
-those use the CTS-CBC encryption mode which requires a key half as
-long as that required by the XTS encryption mode.
+Since each master key can protect many files, it is necessary to
+"tweak" the encryption of each file so that the same plaintext in two
+files doesn't map to the same ciphertext, or vice versa. In most
+cases, fscrypt does this by deriving per-file keys. When a new
+encrypted inode (regular file, directory, or symlink) is created,
+fscrypt randomly generates a 16-byte nonce and stores it in the
+inode's encryption xattr. Then, it uses a KDF (Key Derivation
+Function) to derive the file's key from the master key and nonce.
+
+The Adiantum encryption mode (see `Encryption modes and usage`_) is
+special, since it accepts longer IVs and is suitable for both contents
+and filenames encryption. For it, a "direct key" option is offered
+where the file's nonce is included in the IVs and the master key is
+used for encryption directly. This improves performance; however,
+users must not use the same master key for any other encryption mode.
+
+Below, the KDF and design considerations are described in more detail.
+
+The current KDF works by encrypting the master key with AES-128-ECB,
+using the file's nonce as the AES key. The output is used as the
+derived key. If the output is longer than needed, then it is
+truncated to the needed length.
Note: this KDF meets the primary security requirement, which is to
produce unique derived keys that preserve the entropy of the master
@@ -181,6 +162,20 @@ However, it is nonstandard and has some problems such as being
reversible, so it is generally considered to be a mistake! It may be
replaced with HKDF or another more standard KDF in the future.
+Key derivation was chosen over key wrapping because wrapped keys would
+require larger xattrs which would be less likely to fit in-line in the
+filesystem's inode table, and there didn't appear to be any
+significant advantages to key wrapping. In particular, currently
+there is no requirement to support unlocking a file with multiple
+alternative master keys or to support rotating master keys. Instead,
+the master keys may be wrapped in userspace, e.g. as is done by the
+`fscrypt <https://github.com/google/fscrypt>`_ tool.
+
+Including the inode number in the IVs was considered. However, it was
+rejected as it would have prevented ext4 filesystems from being
+resized, and by itself still wouldn't have been sufficient to prevent
+the same key from being directly reused for both XTS and CTS-CBC.
+
Encryption modes and usage
==========================
@@ -191,54 +186,80 @@ Currently, the following pairs of encryption modes are supported:
- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
- AES-128-CBC for contents and AES-128-CTS-CBC for filenames
+- Adiantum for both contents and filenames
+
+If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair.
-It is strongly recommended to use AES-256-XTS for contents encryption.
AES-128-CBC was added only for low-powered embedded devices with
crypto accelerators such as CAAM or CESA that do not support XTS.
+Adiantum is a (primarily) stream cipher-based mode that is fast even
+on CPUs without dedicated crypto instructions. It's also a true
+wide-block mode, unlike XTS. It can also eliminate the need to derive
+per-file keys. However, it depends on the security of two primitives,
+XChaCha12 and AES-256, rather than just one. See the paper
+"Adiantum: length-preserving encryption for entry-level processors"
+(https://eprint.iacr.org/2018/720.pdf) for more details. To use
+Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
+implementations of ChaCha and NHPoly1305 should be enabled, e.g.
+CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
+
New encryption modes can be added relatively easily, without changes
to individual filesystems. However, authenticated encryption (AE)
modes are not currently supported because of the difficulty of dealing
with ciphertext expansion.
+Contents encryption
+-------------------
+
For file contents, each filesystem block is encrypted independently.
Currently, only the case where the filesystem block size is equal to
-the system's page size (usually 4096 bytes) is supported. With the
-XTS mode of operation (recommended), the logical block number within
-the file is used as the IV. With the CBC mode of operation (not
-recommended), ESSIV is used; specifically, the IV for CBC is the
-logical block number encrypted with AES-256, where the AES-256 key is
-the SHA-256 hash of the inode's data encryption key.
-
-For filenames, the full filename is encrypted at once. Because of the
-requirements to retain support for efficient directory lookups and
-filenames of up to 255 bytes, a constant initialization vector (IV) is
-used. However, each encrypted directory uses a unique key, which
-limits IV reuse to within a single directory. Note that IV reuse in
-the context of CTS-CBC encryption means that when the original
-filenames share a common prefix at least as long as the cipher block
-size (16 bytes for AES), the corresponding encrypted filenames will
-also share a common prefix. This is undesirable; it may be fixed in
-the future by switching to an encryption mode that is a strong
-pseudorandom permutation on arbitrary-length messages, e.g. the HEH
-(Hash-Encrypt-Hash) mode.
-
-Since filenames are encrypted with the CTS-CBC mode of operation, the
-plaintext and ciphertext filenames need not be multiples of the AES
-block size, i.e. 16 bytes. However, the minimum size that can be
-encrypted is 16 bytes, so shorter filenames are NUL-padded to 16 bytes
-before being encrypted. In addition, to reduce leakage of filename
-lengths via their ciphertexts, all filenames are NUL-padded to the
-next 4, 8, 16, or 32-byte boundary (configurable). 32 is recommended
-since this provides the best confidentiality, at the cost of making
-directory entries consume slightly more space. Note that since NUL
-(``\0``) is not otherwise a valid character in filenames, the padding
-will never produce duplicate plaintexts.
+the system's page size (usually 4096 bytes) is supported.
+
+Each block's IV is set to the logical block number within the file as
+a little endian number, except that:
+
+- With CBC mode encryption, ESSIV is also used. Specifically, each IV
+ is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
+ of the file's data encryption key.
+
+- In the "direct key" configuration (FS_POLICY_FLAG_DIRECT_KEY set in
+ the fscrypt_policy), the file's nonce is also appended to the IV.
+ Currently this is only allowed with the Adiantum encryption mode.
+
+Filenames encryption
+--------------------
+
+For filenames, each full filename is encrypted at once. Because of
+the requirements to retain support for efficient directory lookups and
+filenames of up to 255 bytes, the same IV is used for every filename
+in a directory.
+
+However, each encrypted directory still uses a unique key; or
+alternatively (for the "direct key" configuration) has the file's
+nonce included in the IVs. Thus, IV reuse is limited to within a
+single directory.
+
+With CTS-CBC, the IV reuse means that when the plaintext filenames
+share a common prefix at least as long as the cipher block size (16
+bytes for AES), the corresponding encrypted filenames will also share
+a common prefix. This is undesirable. Adiantum does not have this
+weakness, as it is a wide-block encryption mode.
+
+All supported filenames encryption modes accept any plaintext length
+>= 16 bytes; cipher block alignment is not required. However,
+filenames shorter than 16 bytes are NUL-padded to 16 bytes before
+being encrypted. In addition, to reduce leakage of filename lengths
+via their ciphertexts, all filenames are NUL-padded to the next 4, 8,
+16, or 32-byte boundary (configurable). 32 is recommended since this
+provides the best confidentiality, at the cost of making directory
+entries consume slightly more space. Note that since NUL (``\0``) is
+not otherwise a valid character in filenames, the padding will never
+produce duplicate plaintexts.
Symbolic link targets are considered a type of filename and are
-encrypted in the same way as filenames in directory entries. Each
-symlink also uses a unique key; hence, the hardcoded IV is not a
-problem for symlinks.
+encrypted in the same way as filenames in directory entries, except
+that IV reuse is not a problem as each symlink has its own inode.
User API
========
@@ -272,9 +293,13 @@ This structure must be initialized as follows:
and FS_ENCRYPTION_MODE_AES_256_CTS (4) for
``filenames_encryption_mode``.
-- ``flags`` must be set to a value from ``<linux/fs.h>`` which
+- ``flags`` must contain a value from ``<linux/fs.h>`` which
identifies the amount of NUL-padding to use when encrypting
filenames. If unsure, use FS_POLICY_FLAGS_PAD_32 (0x3).
+ In addition, if the chosen encryption modes are both
+ FS_ENCRYPTION_MODE_ADIANTUM, this can contain
+ FS_POLICY_FLAG_DIRECT_KEY to specify that the master key should be
+ used directly, without key derivation.
- ``master_key_descriptor`` specifies how to find the master key in
the keyring; see `Adding keys`_. It is up to userspace to choose a
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 46d1b1be3a51..605befab300b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -359,3 +359,24 @@ encryption of files and directories.
:maxdepth: 2
fscrypt
+
+Pathname lookup
+===============
+
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+It has subsequently been updated to reflect changes in the kernel
+including:
+
+- per-directory parallel name lookup.
+
+.. toctree::
+ :maxdepth: 2
+
+ path-lookup.rst
diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX
deleted file mode 100644
index 53f3b596ac0d..000000000000
--- a/Documentation/filesystems/nfs/00-INDEX
+++ /dev/null
@@ -1,26 +0,0 @@
-00-INDEX
- - this file (nfs-related documentation).
-Exporting
- - explanation of how to make filesystems exportable.
-fault_injection.txt
- - information for using fault injection on the server
-knfsd-stats.txt
- - statistics which the NFS server makes available to user space.
-nfs.txt
- - nfs client, and DNS resolution for fs_locations.
-nfs41-server.txt
- - info on the Linux server implementation of NFSv4 minor version 1.
-nfs-rdma.txt
- - how to install and setup the Linux NFS/RDMA client and server software
-nfsd-admin-interfaces.txt
- - Administrative interfaces for nfsd.
-nfsroot.txt
- - short guide on setting up a diskless box with NFS root filesystem.
-pnfs.txt
- - short explanation of some of the internals of the pnfs client code
-rpc-cache.txt
- - introduction to the caching mechanisms in the sunrpc layer.
-idmapper.txt
- - information for configuring request-keys to be used by idmapper
-rpc-server-gss.txt
- - Information on GSS authentication support in the NFS Server
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt
index ebcaaee21616..c4dac829db0f 100644
--- a/Documentation/filesystems/nfs/rpc-cache.txt
+++ b/Documentation/filesystems/nfs/rpc-cache.txt
@@ -84,7 +84,7 @@ Creating a Cache
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
- cache with sunrpc_cache_lookup, and update the item
+ cache with sunrpc_cache_lookup_rcu, and update the item
with sunrpc_cache_update.
@@ -95,7 +95,7 @@ Creating a Cache
Using a cache
-------------
-To find a value in a cache, call sunrpc_cache_lookup passing a pointer
+To find a value in a cache, call sunrpc_cache_lookup_rcu passing a pointer
to the cache_head in a sample item with the 'key' fields filled in.
This will be passed to ->match to identify the target entry. If no
entry is found, a new entry will be create, added to the cache, and
@@ -116,7 +116,7 @@ item does become valid, the deferred copy of the request will be
revisited (->revisit). It is expected that this method will
reschedule the request for processing.
-The value returned by sunrpc_cache_lookup can also be passed to
+The value returned by sunrpc_cache_lookup_rcu can also be passed to
sunrpc_cache_update to set the content for the item. A second item is
passed which should hold the content. If the item found by _lookup
has valid data, then it is discarded and a new item is created. This
diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.rst
index e2edd45c4bc0..9d6b68853f5b 100644
--- a/Documentation/filesystems/path-lookup.md
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1,20 +1,6 @@
-<head>
-<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
-</head>
-Pathname lookup in Linux.
-=========================
-
-This write-up is based on three articles published at lwn.net:
-
-- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
-- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
-- <https://lwn.net/Articles/650786/> A walk among the symlinks
-
-Written by Neil Brown with help from Al Viro and Jon Corbet.
-
-Introduction
-------------
+Introduction to pathname lookup
+===============================
The most obvious aspect of pathname lookup, which very little
exploration is needed to discover, is that it is complex. There are
@@ -32,58 +18,58 @@ distinctions we need to clarify first.
There are two sorts of ...
--------------------------
-[`openat()`]: http://man7.org/linux/man-pages/man2/openat.2.html
+.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
Pathnames (sometimes "file names"), used to identify objects in the
filesystem, will be familiar to most readers. They contain two sorts
-of elements: "slashes" that are sequences of one or more "`/`"
+of elements: "slashes" that are sequences of one or more "``/``"
characters, and "components" that are sequences of one or more
-non-"`/`" characters. These form two kinds of paths. Those that
+non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
from some other location specified by a file descriptor given to a
-"xxx`at`" system call such as "[`openat()`]".
+"``XXXat``" system call such as `openat() <openat_>`_.
-[`execveat()`]: http://man7.org/linux/man-pages/man2/execveat.2.html
+.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx`at`" system calls
-in Linux permit it when the `AT_EMPTY_PATH` flag is given. For
+generally forbidden in POSIX, but some of those "xxx``at``" system calls
+in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
-can execute it by calling [`execveat()`] passing the file descriptor,
-an empty path, and the `AT_EMPTY_PATH` flag.
+can execute it by calling `execveat() <execveat_>`_ passing
+the file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag.
These paths can be divided into two sections: the final component and
everything else. The "everything else" is the easy bit. In all cases
it must identify a directory that already exists, otherwise an error
-such as `ENOENT` or `ENOTDIR` will be reported.
+such as ``ENOENT`` or ``ENOTDIR`` will be reported.
The final component is not so simple. Not only do different system
calls interpret it quite differently (e.g. some create it, some do
not), but it might not even exist: neither the empty pathname nor the
pathname that is just slashes have a final component. If it does
-exist, it could be "`.`" or "`..`" which are handled quite differently
+exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-[POSIX]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
-If a pathname ends with a slash, such as "`/tmp/foo/`" it might be
+If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
-particular, `mkdir()` and `rmdir()` each create or remove a directory named
+particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "`/`". According to [POSIX]
+ending in "``/``". According to POSIX_
-> A pathname that contains at least one non- &lt;slash> character and
-> that ends with one or more trailing &lt;slash> characters shall not
-> be resolved successfully unless the last pathname component before
-> the trailing <slash> characters names an existing directory or a
-> directory entry that is to be created for a directory immediately
-> after the pathname is resolved.
+ A pathname that contains at least one non- &lt;slash> character and
+ that ends with one or more trailing &lt;slash> characters shall not
+ be resolved successfully unless the last pathname component before
+ the trailing <slash> characters names an existing directory or a
+ directory entry that is to be created for a directory immediately
+ after the pathname is resolved.
-The Linux pathname walking code (mostly in `fs/namei.c`) deals with
+The Linux pathname walking code (mostly in ``fs/namei.c``) deals with
all of these issues: breaking the path into components, handling the
"everything else" quite separately from the final component, and
checking that the trailing slash is not used where it isn't
@@ -100,15 +86,15 @@ of the possible races are seen most clearly in the context of the
"dcache" and an understanding of that is central to understanding
pathname lookup.
-More than just a cache.
------------------------
+More than just a cache
+----------------------
The "dcache" caches information about names in each filesystem to
make them quickly available for lookup. Each entry (known as a
"dentry") contains three significant fields: a component name, a
pointer to a parent dentry, and a pointer to the "inode" which
contains further information about the object in that parent with
-the given name. The inode pointer can be `NULL` indicating that the
+the given name. The inode pointer can be ``NULL`` indicating that the
name doesn't exist in the parent. While there can be linkage in the
dentry of a directory to the dentries of the children, that linkage is
not used for pathname lookup, and so will not be considered here.
@@ -135,7 +121,7 @@ whether remote filesystems like NFS and 9P, or cluster filesystems
like ocfs2 or cephfs. These filesystems allow the VFS to revalidate
cached information, and must provide their own protection against
awkward races. The VFS can detect these filesystems by the
-`DCACHE_OP_REVALIDATE` flag being set in the dentry.
+``DCACHE_OP_REVALIDATE`` flag being set in the dentry.
REF-walk: simple concurrency management with refcounts and spinlocks
--------------------------------------------------------------------
@@ -144,22 +130,23 @@ With all of those divisions carefully classified, we can now start
looking at the actual process of walking along a path. In particular
we will start with the handling of the "everything else" part of a
pathname, and focus on the "REF-walk" approach to concurrency
-management. This code is found in the `link_path_walk()` function, if
-you ignore all the places that only run when "`LOOKUP_RCU`"
+management. This code is found in the ``link_path_walk()`` function, if
+you ignore all the places that only run when "``LOOKUP_RCU``"
(indicating the use of RCU-walk) is set.
-[Meet the Lockers]: https://lwn.net/Articles/453685/
+.. _Meet the Lockers: https://lwn.net/Articles/453685/
REF-walk is fairly heavy-handed with locks and reference counts. Not
as heavy-handed as in the old "big kernel lock" days, but certainly not
afraid of taking a lock when one is needed. It uses a variety of
different concurrency controls. A background understanding of the
various primitives is assumed, or can be gleaned from elsewhere such
-as in [Meet the Lockers].
+as in `Meet the Lockers`_.
The locking mechanisms used by REF-walk include:
-### dentry->d_lockref ###
+dentry->d_lockref
+~~~~~~~~~~~~~~~~~
This uses the lockref primitive to provide both a spinlock and a
reference count. The special-sauce of this primitive is that the
@@ -168,49 +155,51 @@ with a single atomic memory operation.
Holding a reference on a dentry ensures that the dentry won't suddenly
be freed and used for something else, so the values in various fields
-will behave as expected. It also protects the `->d_inode` reference
+will behave as expected. It also protects the ``->d_inode`` reference
to the inode to some extent.
The association between a dentry and its inode is fairly permanent.
For example, when a file is renamed, the dentry and inode move
together to the new location. When a file is created the dentry will
-initially be negative (i.e. `d_inode` is `NULL`), and will be assigned
+initially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned
to the new inode as part of the act of creation.
When a file is deleted, this can be reflected in the cache either by
-setting `d_inode` to `NULL`, or by removing it from the hash table
+setting ``d_inode`` to ``NULL``, or by removing it from the hash table
(described shortly) used to look up the name in the parent directory.
If the dentry is still in use the second option is used as it is
perfectly legal to keep using an open file after it has been deleted
and having the dentry around helps. If the dentry is not otherwise in
-use (i.e. if the refcount in `d_lockref` is one), only then will
-`d_inode` be set to `NULL`. Doing it this way is more efficient for a
+use (i.e. if the refcount in ``d_lockref`` is one), only then will
+``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a
very common case.
-So as long as a counted reference is held to a dentry, a non-`NULL` `->d_inode`
+So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
value will never be changed.
-### dentry->d_lock ###
+dentry->d_lock
+~~~~~~~~~~~~~~
-`d_lock` is a synonym for the spinlock that is part of `d_lockref` above.
+``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above.
For our purposes, holding this lock protects against the dentry being
-renamed or unlinked. In particular, its parent (`d_parent`), and its
-name (`d_name`) cannot be changed, and it cannot be removed from the
+renamed or unlinked. In particular, its parent (``d_parent``), and its
+name (``d_name``) cannot be changed, and it cannot be removed from the
dentry hash table.
-When looking for a name in a directory, REF-walk takes `d_lock` on
+When looking for a name in a directory, REF-walk takes ``d_lock`` on
each candidate dentry that it finds in the hash table and then checks
that the parent and name are correct. So it doesn't lock the parent
while searching in the cache; it only locks children.
-When looking for the parent for a given name (to handle "`..`"),
-REF-walk can take `d_lock` to get a stable reference to `d_parent`,
+When looking for the parent for a given name (to handle "``..``"),
+REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
but it first tries a more lightweight approach. As seen in
-`dget_parent()`, if a reference can be claimed on the parent, and if
-subsequently `d_parent` can be seen to have not changed, then there is
+``dget_parent()``, if a reference can be claimed on the parent, and if
+subsequently ``d_parent`` can be seen to have not changed, then there is
no need to actually take the lock on the child.
-### rename_lock ###
+rename_lock
+~~~~~~~~~~~
Looking up a given name in a given directory involves computing a hash
from the two values (the name and the dentry of the directory),
@@ -224,71 +213,117 @@ happened to be looking at a dentry that was moved in this way,
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (`d_lookup()`) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does _not_ try to prevent this
from happening, but only to detect when it happens.
-`rename_lock` is a seqlock that is updated whenever any dentry is
-renamed. If `d_lookup` finds that a rename happened while it
+``rename_lock`` is a seqlock that is updated whenever any dentry is
+renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
-### inode->i_mutex ###
+inode->i_rwsem
+~~~~~~~~~~~~~~
-`i_mutex` is a mutex that serializes all changes to a particular
-directory. This ensures that, for example, an `unlink()` and a `rename()`
+``i_rwsem`` is a read/write semaphore that serializes all changes to a particular
+directory. This ensures that, for example, an ``unlink()`` and a ``rename()``
cannot both happen at the same time. It also keeps the directory
stable while the filesystem is asked to look up a name that is not
-currently in the dcache.
+currently in the dcache or, optionally, when the list of entries in a
+directory is being retrieved with ``readdir()``.
-This has a complementary role to that of `d_lock`: `i_mutex` on a
-directory protects all of the names in that directory, while `d_lock`
+This has a complementary role to that of ``d_lock``: ``i_rwsem`` on a
+directory protects all of the names in that directory, while ``d_lock``
on a name protects just one name in a directory. Most changes to the
-dcache hold `i_mutex` on the relevant directory inode and briefly take
-`d_lock` on one or more the dentries while the change happens. One
+dcache hold ``i_rwsem`` on the relevant directory inode and briefly take
+``d_lock`` on one or more the dentries while the change happens. One
exception is when idle dentries are removed from the dcache due to
-memory pressure. This uses `d_lock`, but `i_mutex` plays no role.
+memory pressure. This uses ``d_lock``, but ``i_rwsem`` plays no role.
-The mutex affects pathname lookup in two distinct ways. Firstly it
-serializes lookup of a name in a directory. `walk_component()` uses
-`lookup_fast()` first which, in turn, checks to see if the name is in the cache,
-using only `d_lock` locking. If the name isn't found, then `walk_component()`
-falls back to `lookup_slow()` which takes `i_mutex`, checks again that
+The semaphore affects pathname lookup in two distinct ways. Firstly it
+prevents changes during lookup of a name in a directory. ``walk_component()`` uses
+``lookup_fast()`` first which, in turn, checks to see if the name is in the cache,
+using only ``d_lock`` locking. If the name isn't found, then ``walk_component()``
+falls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that
the name isn't in the cache, and then calls in to the filesystem to get a
definitive answer. A new dentry will be added to the cache regardless of
the result.
Secondly, when pathname lookup reaches the final component, it will
-sometimes need to take `i_mutex` before performing the last lookup so
+sometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so
that the required exclusion can be achieved. How path lookup chooses
-to take, or not take, `i_mutex` is one of the
+to take, or not take, ``i_rwsem`` is one of the
issues addressed in a subsequent section.
-### mnt->mnt_count ###
-
-`mnt_count` is a per-CPU reference counter on "`mount`" structures.
+If two threads attempt to look up the same name at the same time - a
+name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
+not prevent them both adding new dentries with the same name. As this
+would result in confusion an extra level of interlocking is used,
+based around a secondary hash table (``in_lookup_hashtable``) and a
+per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
+
+To add a new dentry to the cache while only holding a shared lock on
+``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a
+dentry, stores the required name and parent in it, checks if there
+is already a matching dentry in the primary or secondary hash
+tables, and if not, stores the newly allocated dentry in the secondary
+hash table, with ``DCACHE_PAR_LOOKUP`` set.
+
+If a matching dentry was found in the primary hash table then that is
+returned and the caller can know that it lost a race with some other
+thread adding the entry. If no matching dentry is found in either
+cache, the newly allocated dentry is returned and the caller can
+detect this from the presence of ``DCACHE_PAR_LOOKUP``. In this case it
+knows that it has won any race and now is responsible for asking the
+filesystem to perform the lookup and find the matching inode. When
+the lookup is complete, it must call ``d_lookup_done()`` which clears
+the flag and does some other house keeping, including removing the
+dentry from the secondary hash table - it will normally have been
+added to the primary hash table already. Note that a ``struct
+waitqueue_head`` is passed to ``d_alloc_parallel()``, and
+``d_lookup_done()`` must be called while this ``waitqueue_head`` is still
+in scope.
+
+If a matching dentry is found in the secondary hash table,
+``d_alloc_parallel()`` has a little more work to do. It first waits for
+``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed
+to the instance of ``d_alloc_parallel()`` that won the race and that
+will be woken by the call to ``d_lookup_done()``. It then checks to see
+if the dentry has now been added to the primary hash table. If it
+has, the dentry is returned and the caller just sees that it lost any
+race. If it hasn't been added to the primary hash table, the most
+likely explanation is that some other dentry was added instead using
+``d_splice_alias()``. In any case, ``d_alloc_parallel()`` repeats all the
+look ups from the start and will normally return something from the
+primary hash table.
+
+mnt->mnt_count
+~~~~~~~~~~~~~~
+
+``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
Per-CPU here means that incrementing the count is cheap as it only
uses CPU-local memory, but checking if the count is zero is expensive as
-it needs to check with every CPU. Taking a `mnt_count` reference
+it needs to check with every CPU. Taking a ``mnt_count`` reference
prevents the mount structure from disappearing as the result of regular
unmount operations, but does not prevent a "lazy" unmount. So holding
-`mnt_count` doesn't ensure that the mount remains in the namespace and,
+``mnt_count`` doesn't ensure that the mount remains in the namespace and,
in particular, doesn't stabilize the link to the mounted-on dentry. It
-does, however, ensure that the `mount` data structure remains coherent,
+does, however, ensure that the ``mount`` data structure remains coherent,
and it provides a reference to the root dentry of the mounted
-filesystem. So a reference through `->mnt_count` provides a stable
+filesystem. So a reference through ``->mnt_count`` provides a stable
reference to the mounted dentry, but not the mounted-on dentry.
-### mount_lock ###
+mount_lock
+~~~~~~~~~~
-`mount_lock` is a global seqlock, a bit like `rename_lock`. It can be used to
+``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to
check if any change has been made to any mount points.
While walking down the tree (away from the root) this lock is used when
crossing a mount point to check that the crossing was safe. That is,
the value in the seqlock is read, then the code finds the mount that
is mounted on the current directory, if there is one, and increments
-the `mnt_count`. Finally the value in `mount_lock` is checked against
+the ``mnt_count``. Finally the value in ``mount_lock`` is checked against
the old value. If there is no change, then the crossing was safe. If there
-was a change, the `mnt_count` is decremented and the whole process is
+was a change, the ``mnt_count`` is decremented and the whole process is
retried.
When walking up the tree (towards the root) by following a ".." link,
@@ -298,7 +333,8 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
-### RCU ###
+RCU
+~~~
Finally the global (but extremely lightweight) RCU read lock is held
from time to time to ensure certain data structures don't get freed
@@ -307,137 +343,141 @@ unexpectedly.
In particular it is held while scanning chains in the dcache hash
table, and the mount point hash table.
-Bringing it together with `struct nameidata`
+Bringing it together with ``struct nameidata``
--------------------------------------------
-[First edition Unix]: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
-in a `struct nameidata`, "namei" being the traditional name - dating
-all the way back to [First Edition Unix] - of the function that
-converts a "name" to an "inode". `struct nameidata` contains (among
+in a ``struct nameidata``, "namei" being the traditional name - dating
+all the way back to `First Edition Unix`_ - of the function that
+converts a "name" to an "inode". ``struct nameidata`` contains (among
other fields):
-### `struct path path` ###
+``struct path path``
+~~~~~~~~~~~~~~~~~~
-A `path` contains a `struct vfsmount` (which is
-embedded in a `struct mount`) and a `struct dentry`. Together these
+A ``path`` contains a ``struct vfsmount`` (which is
+embedded in a ``struct mount``) and a ``struct dentry``. Together these
record the current status of the walk. They start out referring to the
starting point (the current working directory, the root directory, or some other
directory identified by a file descriptor), and are updated on each
-step. A reference through `d_lockref` and `mnt_count` is always
+step. A reference through ``d_lockref`` and ``mnt_count`` is always
held.
-### `struct qstr last` ###
+``struct qstr last``
+~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ `nul` terminated)
+This is a string together with a length (i.e. _not_ ``nul`` terminated)
that is the "next" component in the pathname.
-### `int last_type` ###
+``int last_type``
+~~~~~~~~~~~~~~~
-This is one of `LAST_NORM`, `LAST_ROOT`, `LAST_DOT`, `LAST_DOTDOT`, or
-`LAST_BIND`. The `last` field is only valid if the type is
-`LAST_NORM`. `LAST_BIND` is used when following a symlink and no
+This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
+``LAST_BIND``. The ``last`` field is only valid if the type is
+``LAST_NORM``. ``LAST_BIND`` is used when following a symlink and no
components of the symlink have been processed yet. Others should be
fairly self-explanatory.
-### `struct path root` ###
+``struct path root``
+~~~~~~~~~~~~~~~~~~
This is used to hold a reference to the effective root of the
filesystem. Often that reference won't be needed, so this field is
only assigned the first time it is used, or when a non-standard root
-is requested. Keeping a reference in the `nameidata` ensures that
+is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
-with a `chroot()` system call.
+with a ``chroot()`` system call.
The root is needed when either of two conditions holds: (1) either the
-pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
-component is being handled, since "`..`" from the root must always stay
+pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
+component is being handled, since "``..``" from the root must always stay
at the root. The value used is usually the current root directory of
the calling process. An alternate root can be provided as when
-`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
-`mount_subtree()`. In each case a pathname is being looked up in a very
+``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call
+``mount_subtree()``. In each case a pathname is being looked up in a very
specific part of the filesystem, and the lookup must not be allowed to
-escape that subtree. It works a bit like a local `chroot()`.
+escape that subtree. It works a bit like a local ``chroot()``.
Ignoring the handling of symbolic links, we can now describe the
-"`link_path_walk()`" function, which handles the lookup of everything
+"``link_path_walk()``" function, which handles the lookup of everything
except the final component as:
-> Given a path (`name`) and a nameidata structure (`nd`), check that the
-> current directory has execute permission and then advance `name`
-> over one component while updating `last_type` and `last`. If that
-> was the final component, then return, otherwise call
-> `walk_component()` and repeat from the top.
+ Given a path (``name``) and a nameidata structure (``nd``), check that the
+ current directory has execute permission and then advance ``name``
+ over one component while updating ``last_type`` and ``last``. If that
+ was the final component, then return, otherwise call
+ ``walk_component()`` and repeat from the top.
-`walk_component()` is even easier. If the component is `LAST_DOTS`,
-it calls `handle_dots()` which does the necessary locking as already
-described. If it finds a `LAST_NORM` component it first calls
-"`lookup_fast()`" which only looks in the dcache, but will ask the
+``walk_component()`` is even easier. If the component is ``LAST_DOTS``,
+it calls ``handle_dots()`` which does the necessary locking as already
+described. If it finds a ``LAST_NORM`` component it first calls
+"``lookup_fast()``" which only looks in the dcache, but will ask the
filesystem to revalidate the result if it is that sort of filesystem.
-If that doesn't get a good result, it calls "`lookup_slow()`" which
-takes the `i_mutex`, rechecks the cache, and then asks the filesystem
+If that doesn't get a good result, it calls "``lookup_slow()``" which
+takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
to find a definitive answer. Each of these will call
-`follow_managed()` (as described below) to handle any mount points.
+``follow_managed()`` (as described below) to handle any mount points.
-In the absence of symbolic links, `walk_component()` creates a new
-`struct path` containing a counted reference to the new dentry and a
-reference to the new `vfsmount` which is only counted if it is
-different from the previous `vfsmount`. It then calls
-`path_to_nameidata()` to install the new `struct path` in the
-`struct nameidata` and drop the unneeded references.
+In the absence of symbolic links, ``walk_component()`` creates a new
+``struct path`` containing a counted reference to the new dentry and a
+reference to the new ``vfsmount`` which is only counted if it is
+different from the previous ``vfsmount``. It then calls
+``path_to_nameidata()`` to install the new ``struct path`` in the
+``struct nameidata`` and drop the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
seem obvious, but is worth pointing out so that we will recognize its
analogue in the "RCU-walk" version.
-Handling the final component.
------------------------------
+Handling the final component
+----------------------------
-`link_path_walk()` only walks as far as setting `nd->last` and
-`nd->last_type` to refer to the final component of the path. It does
-not call `walk_component()` that last time. Handling that final
+``link_path_walk()`` only walks as far as setting ``nd->last`` and
+``nd->last_type`` to refer to the final component of the path. It does
+not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
-`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
-`path_openat()` each of which handles the differing requirements of
+``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
+``path_openat()`` each of which handles the differing requirements of
different system calls.
-`path_parentat()` is clearly the simplest - it just wraps a little bit
-of housekeeping around `link_path_walk()` and returns the parent
+``path_parentat()`` is clearly the simplest - it just wraps a little bit
+of housekeeping around ``link_path_walk()`` and returns the parent
directory and final component to the caller. The caller will be either
-aiming to create a name (via `filename_create()`) or remove or rename
-a name (in which case `user_path_parent()` is used). They will use
-`i_mutex` to exclude other changes while they validate and then
+aiming to create a name (via ``filename_create()``) or remove or rename
+a name (in which case ``user_path_parent()`` is used). They will use
+``i_rwsem`` to exclude other changes while they validate and then
perform their operation.
-`path_lookupat()` is nearly as simple - it is used when an existing
-object is wanted such as by `stat()` or `chmod()`. It essentially just
-calls `walk_component()` on the final component through a call to
-`lookup_last()`. `path_lookupat()` returns just the final dentry.
+``path_lookupat()`` is nearly as simple - it is used when an existing
+object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
+calls ``walk_component()`` on the final component through a call to
+``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
-`path_mountpoint()` handles the special case of unmounting which must
+``path_mountpoint()`` handles the special case of unmounting which must
not try to revalidate the mounted filesystem. It effectively
-contains, through a call to `mountpoint_last()`, an alternate
-implementation of `lookup_slow()` which skips that step. This is
+contains, through a call to ``mountpoint_last()``, an alternate
+implementation of ``lookup_slow()`` which skips that step. This is
important when unmounting a filesystem that is inaccessible, such as
one provided by a dead NFS server.
-Finally `path_openat()` is used for the `open()` system call; it
-contains, in support functions starting with "`do_last()`", all the
+Finally ``path_openat()`` is used for the ``open()`` system call; it
+contains, in support functions starting with "``do_last()``", all the
complexity needed to handle the different subtleties of O_CREAT (with
-or without O_EXCL), final "`/`" characters, and trailing symbolic
+or without O_EXCL), final "``/``" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which
-focuses on those symbolic links. "`do_last()`" will sometimes, but
-not always, take `i_mutex`, depending on what it finds.
+focuses on those symbolic links. "``do_last()``" will sometimes, but
+not always, take ``i_rwsem``, depending on what it finds.
Each of these, or the functions which call them, need to be alert to
-the possibility that the final component is not `LAST_NORM`. If the
+the possibility that the final component is not ``LAST_NORM``. If the
goal of the lookup is to create something, then any value for
-`last_type` other than `LAST_NORM` will result in an error. For
-example if `path_parentat()` reports `LAST_DOTDOT`, then the caller
+``last_type`` other than ``LAST_NORM`` will result in an error. For
+example if ``path_parentat()`` reports ``LAST_DOTDOT``, then the caller
won't try to create that name. They also check for trailing slashes
-by testing `last.name[last.len]`. If there is any character beyond
+by testing ``last.name[last.len]``. If there is any character beyond
the final component, it must be a trailing slash.
Revalidation and automounts
@@ -448,12 +488,12 @@ process not yet covered. One is the handling of stale cache entries
and the other is automounts.
On filesystems that require it, the lookup routines will call the
-`->d_revalidate()` dentry method to ensure that the cached information
+``->d_revalidate()`` dentry method to ensure that the cached information
is current. This will often confirm validity or update a few details
from a server. In some cases it may find that there has been change
further up the path and that something that was thought to be valid
previously isn't really. When this happens the lookup of the whole
-path is aborted and retried with the "`LOOKUP_REVAL`" flag set. This
+path is aborted and retried with the "``LOOKUP_REVAL``" flag set. This
forces revalidation to be more thorough. We will see more details of
this retry process in the next article.
@@ -465,52 +505,55 @@ tree, but a few notes specifically related to path lookup are in order
here.
The Linux VFS has a concept of "managed" dentries which is reflected
-in function names such as "`follow_managed()`". There are three
+in function names such as "``follow_managed()``". There are three
potentially interesting things about these dentries corresponding
-to three different flags that might be set in `dentry->d_flags`:
+to three different flags that might be set in ``dentry->d_flags``:
-### `DCACHE_MANAGE_TRANSIT` ###
+``DCACHE_MANAGE_TRANSIT``
+~~~~~~~~~~~~~~~~~~~~~~~
If this flag has been set, then the filesystem has requested that the
-`d_manage()` dentry operation be called before handling any possible
+``d_manage()`` dentry operation be called before handling any possible
mount point. This can perform two particular services:
It can block to avoid races. If an automount point is being
-unmounted, the `d_manage()` function will usually wait for that
+unmounted, the ``d_manage()`` function will usually wait for that
process to complete before letting the new lookup proceed and possibly
trigger a new automount.
It can selectively allow only some processes to transit through a
mount point. When a server process is managing automounts, it may
need to access a directory without triggering normal automount
-processing. That server process can identify itself to the `autofs`
+processing. That server process can identify itself to the ``autofs``
filesystem, which will then give it a special pass through
-`d_manage()` by returning `-EISDIR`.
+``d_manage()`` by returning ``-EISDIR``.
-### `DCACHE_MOUNTED` ###
+``DCACHE_MOUNTED``
+~~~~~~~~~~~~~~~~
This flag is set on every dentry that is mounted on. As Linux
supports multiple filesystem namespaces, it is possible that the
dentry may not be mounted on in *this* namespace, just in some
other. So this flag is seen as a hint, not a promise.
-If this flag is set, and `d_manage()` didn't return `-EISDIR`,
-`lookup_mnt()` is called to examine the mount hash table (honoring the
-`mount_lock` described earlier) and possibly return a new `vfsmount`
-and a new `dentry` (both with counted references).
+If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
+``lookup_mnt()`` is called to examine the mount hash table (honoring the
+``mount_lock`` described earlier) and possibly return a new ``vfsmount``
+and a new ``dentry`` (both with counted references).
-### `DCACHE_NEED_AUTOMOUNT` ###
+``DCACHE_NEED_AUTOMOUNT``
+~~~~~~~~~~~~~~~~~~~~~~~
-If `d_manage()` allowed us to get this far, and `lookup_mnt()` didn't
-find a mount point, then this flag causes the `d_automount()` dentry
+If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't
+find a mount point, then this flag causes the ``d_automount()`` dentry
operation to be called.
-The `d_automount()` operation can be arbitrarily complex and may
+The ``d_automount()`` operation can be arbitrarily complex and may
communicate with server processes etc. but it should ultimately either
report that there was an error, that there was nothing to mount, or
-should provide an updated `struct path` with new `dentry` and `vfsmount`.
+should provide an updated ``struct path`` with new ``dentry`` and ``vfsmount``.
-In the latter case, `finish_automount()` will be called to safely
+In the latter case, ``finish_automount()`` will be called to safely
install the new mount point into the mount table.
There is no new locking of import here and it is important that no
@@ -567,7 +610,7 @@ isn't in the cache, then it tries to stop gracefully and switch to
REF-walk.
This stopping requires getting a counted reference on the current
-`vfsmount` and `dentry`, and ensuring that these are still valid -
+``vfsmount`` and ``dentry``, and ensuring that these are still valid -
that a path walk with REF-walk would have found the same entries.
This is an invariant that RCU-walk must guarantee. It can only make
decisions, such as selecting the next step, that are decisions which
@@ -578,21 +621,21 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and
restarts from the top with REF-walk.
This pattern of "try RCU-walk, if that fails try REF-walk" can be
-clearly seen in functions like `filename_lookup()`,
-`filename_parentat()`, `filename_mountpoint()`,
-`do_filp_open()`, and `do_file_open_root()`. These five
-correspond roughly to the four `path_`* functions we met earlier,
-each of which calls `link_path_walk()`. The `path_*` functions are
+clearly seen in functions like ``filename_lookup()``,
+``filename_parentat()``, ``filename_mountpoint()``,
+``do_filp_open()``, and ``do_file_open_root()``. These five
+correspond roughly to the four ``path_``* functions we met earlier,
+each of which calls ``link_path_walk()``. The ``path_*`` functions are
called using different mode flags until a mode is found which works.
-They are first called with `LOOKUP_RCU` set to request "RCU-walk". If
-that fails with the error `ECHILD` they are called again with no
+They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
+that fails with the error ``ECHILD`` they are called again with no
special flag to request "REF-walk". If either of those report the
-error `ESTALE` a final attempt is made with `LOOKUP_REVAL` set (and no
-`LOOKUP_RCU`) to ensure that entries found in the cache are forcibly
+error ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no
+``LOOKUP_RCU``) to ensure that entries found in the cache are forcibly
revalidated - normally entries are only revalidated if the filesystem
determines that they are too old to trust.
-The `LOOKUP_RCU` attempt may drop that flag internally and switch to
+The ``LOOKUP_RCU`` attempt may drop that flag internally and switch to
REF-walk, but will never then try to switch back to RCU-walk. Places
that trip up RCU-walk are much more likely to be near the leaves and
so it is very unlikely that there will be much, if any, benefit from
@@ -602,7 +645,7 @@ RCU and seqlocks: fast and light
--------------------------------
RCU is, unsurprisingly, critical to RCU-walk mode. The
-`rcu_read_lock()` is held for the entire time that RCU-walk is walking
+``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
down a path. The particular guarantee it provides is that the key
data structures - dentries, inodes, super_blocks, and mounts - will
not be freed while the lock is held. They might be unlinked or
@@ -614,7 +657,7 @@ seqlocks.
As we saw above, REF-walk holds a counted reference to the current
dentry and the current vfsmount, and does not release those references
before taking references to the "next" dentry or vfsmount. It also
-sometimes takes the `d_lock` spinlock. These references and locks are
+sometimes takes the ``d_lock`` spinlock. These references and locks are
taken to prevent certain changes from happening. RCU-walk must not
take those references or locks and so cannot prevent such changes.
Instead, it checks to see if a change has been made, and aborts or
@@ -624,123 +667,126 @@ To preserve the invariant mentioned above (that RCU-walk may only make
decisions that REF-walk could have made), it must make the checks at
or near the same places that REF-walk holds the references. So, when
REF-walk increments a reference count or takes a spinlock, RCU-walk
-samples the status of a seqlock using `read_seqcount_begin()` or a
+samples the status of a seqlock using ``read_seqcount_begin()`` or a
similar function. When REF-walk decrements the count or drops the
lock, RCU-walk checks if the sampled status is still valid using
-`read_seqcount_retry()` or similar.
+``read_seqcount_retry()`` or similar.
However, there is a little bit more to seqlocks than that. If
RCU-walk accesses two different fields in a seqlock-protected
structure, or accesses the same field twice, there is no a priori
guarantee of any consistency between those accesses. When consistency
is needed - which it usually is - RCU-walk must take a copy and then
-use `read_seqcount_retry()` to validate that copy.
+use ``read_seqcount_retry()`` to validate that copy.
-`read_seqcount_retry()` not only checks the sequence number, but also
+``read_seqcount_retry()`` not only checks the sequence number, but also
imposes a memory barrier so that no memory-read instruction from
*before* the call can be delayed until *after* the call, either by the
CPU or by the compiler. A simple example of this can be seen in
-`slow_dentry_cmp()` which, for filesystems which do not use simple
+``slow_dentry_cmp()`` which, for filesystems which do not use simple
byte-wise name equality, calls into the filesystem to compare a name
against a dentry. The length and name pointer are copied into local
-variables, then `read_seqcount_retry()` is called to confirm the two
-are consistent, and only then is `->d_compare()` called. When
-standard filename comparison is used, `dentry_cmp()` is called
-instead. Notably it does _not_ use `read_seqcount_retry()`, but
+variables, then ``read_seqcount_retry()`` is called to confirm the two
+are consistent, and only then is ``->d_compare()`` called. When
+standard filename comparison is used, ``dentry_cmp()`` is called
+instead. Notably it does _not_ use ``read_seqcount_retry()``, but
instead has a large comment explaining why the consistency guarantee
-isn't necessary. A subsequent `read_seqcount_retry()` will be
+isn't necessary. A subsequent ``read_seqcount_retry()`` will be
sufficient to catch any problem that could occur at this point.
With that little refresher on seqlocks out of the way we can look at
the bigger picture of how RCU-walk uses seqlocks.
-### `mount_lock` and `nd->m_seq` ###
+``mount_lock`` and ``nd->m_seq``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We already met the `mount_lock` seqlock when REF-walk used it to
+We already met the ``mount_lock`` seqlock when REF-walk used it to
ensure that crossing a mount point is performed safely. RCU-walk uses
it for that too, but for quite a bit more.
-Instead of taking a counted reference to each `vfsmount` as it
-descends the tree, RCU-walk samples the state of `mount_lock` at the
+Instead of taking a counted reference to each ``vfsmount`` as it
+descends the tree, RCU-walk samples the state of ``mount_lock`` at the
start of the walk and stores this initial sequence number in the
-`struct nameidata` in the `m_seq` field. This one lock and one
-sequence number are used to validate all accesses to all `vfsmounts`,
+``struct nameidata`` in the ``m_seq`` field. This one lock and one
+sequence number are used to validate all accesses to all ``vfsmounts``,
and all mount point crossings. As changes to the mount table are
relatively rare, it is reasonable to fall back on REF-walk any time
that any "mount" or "unmount" happens.
-`m_seq` is checked (using `read_seqretry()`) at the end of an RCU-walk
+``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
sequence, whether switching to REF-walk for the rest of the path or
when the end of the path is reached. It is also checked when stepping
-down over a mount point (in `__follow_mount_rcu()`) or up (in
-`follow_dotdot_rcu()`). If it is ever found to have changed, the
+down over a mount point (in ``__follow_mount_rcu()``) or up (in
+``follow_dotdot_rcu()``). If it is ever found to have changed, the
whole RCU-walk sequence is aborted and the path is processed again by
REF-walk.
-If RCU-walk finds that `mount_lock` hasn't changed then it can be sure
+If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
that, had REF-walk taken counted references on each vfsmount, the
results would have been the same. This ensures the invariant holds,
at least for vfsmount structures.
-### `dentry->d_seq` and `nd->seq`. ###
+``dentry->d_seq`` and ``nd->seq``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-In place of taking a count or lock on `d_reflock`, RCU-walk samples
-the per-dentry `d_seq` seqlock, and stores the sequence number in the
-`seq` field of the nameidata structure, so `nd->seq` should always be
-the current sequence number of `nd->dentry`. This number needs to be
+In place of taking a count or lock on ``d_reflock``, RCU-walk samples
+the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
+``seq`` field of the nameidata structure, so ``nd->seq`` should always be
+the current sequence number of ``nd->dentry``. This number needs to be
revalidated after copying, and before using, the name, parent, or
inode of the dentry.
The handling of the name we have already looked at, and the parent is
-only accessed in `follow_dotdot_rcu()` which fairly trivially follows
+only accessed in ``follow_dotdot_rcu()`` which fairly trivially follows
the required pattern, though it does so for three different cases.
-When not at a mount point, `d_parent` is followed and its `d_seq` is
+When not at a mount point, ``d_parent`` is followed and its ``d_seq`` is
collected. When we are at a mount point, we instead follow the
-`mnt->mnt_mountpoint` link to get a new dentry and collect its
-`d_seq`. Then, after finally finding a `d_parent` to follow, we must
+``mnt->mnt_mountpoint`` link to get a new dentry and collect its
+``d_seq``. Then, after finally finding a ``d_parent`` to follow, we must
check if we have landed on a mount point and, if so, must find that
-mount point and follow the `mnt->mnt_root` link. This would imply a
+mount point and follow the ``mnt->mnt_root`` link. This would imply a
somewhat unusual, but certainly possible, circumstance where the
starting point of the path lookup was in part of the filesystem that
was mounted on, and so not visible from the root.
-The inode pointer, stored in `->d_inode`, is a little more
+The inode pointer, stored in ``->d_inode``, is a little more
interesting. The inode will always need to be accessed at least
twice, once to determine if it is NULL and once to verify access
permissions. Symlink handling requires a validated inode pointer too.
Rather than revalidating on each access, a copy is made on the first
-access and it is stored in the `inode` field of `nameidata` from where
+access and it is stored in the ``inode`` field of ``nameidata`` from where
it can be safely accessed without further validation.
-`lookup_fast()` is the only lookup routine that is used in RCU-mode,
-`lookup_slow()` being too slow and requiring locks. It is in
-`lookup_fast()` that we find the important "hand over hand" tracking
+``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
+``lookup_slow()`` being too slow and requiring locks. It is in
+``lookup_fast()`` that we find the important "hand over hand" tracking
of the current dentry.
-The current `dentry` and current `seq` number are passed to
-`__d_lookup_rcu()` which, on success, returns a new `dentry` and a
-new `seq` number. `lookup_fast()` then copies the inode pointer and
-revalidates the new `seq` number. It then validates the old `dentry`
-with the old `seq` number one last time and only then continues. This
-process of getting the `seq` number of the new dentry and then
-checking the `seq` number of the old exactly mirrors the process of
+The current ``dentry`` and current ``seq`` number are passed to
+``__d_lookup_rcu()`` which, on success, returns a new ``dentry`` and a
+new ``seq`` number. ``lookup_fast()`` then copies the inode pointer and
+revalidates the new ``seq`` number. It then validates the old ``dentry``
+with the old ``seq`` number one last time and only then continues. This
+process of getting the ``seq`` number of the new dentry and then
+checking the ``seq`` number of the old exactly mirrors the process of
getting a counted reference to the new dentry before dropping that for
the old dentry which we saw in REF-walk.
-### No `inode->i_mutex` or even `rename_lock` ###
+No ``inode->i_rwsem`` or even ``rename_lock``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A mutex is a fairly heavyweight lock that can only be taken when it is
-permissible to sleep. As `rcu_read_lock()` forbids sleeping,
-`inode->i_mutex` plays no role in RCU-walk. If some other thread does
-take `i_mutex` and modifies the directory in a way that RCU-walk needs
+A semaphore is a fairly heavyweight lock that can only be taken when it is
+permissible to sleep. As ``rcu_read_lock()`` forbids sleeping,
+``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
+take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
to notice, the result will be either that RCU-walk fails to find the
dentry that it is looking for, or it will find a dentry which
-`read_seqretry()` won't validate. In either case it will drop down to
+``read_seqretry()`` won't validate. In either case it will drop down to
REF-walk mode which can take whatever locks are needed.
-Though `rename_lock` could be used by RCU-walk as it doesn't require
-any sleeping, RCU-walk doesn't bother. REF-walk uses `rename_lock` to
+Though ``rename_lock`` could be used by RCU-walk as it doesn't require
+any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
protect against the possibility of hash chains in the dcache changing
while they are being searched. This can result in failing to find
something that actually is there. When RCU-walk fails to find
@@ -749,57 +795,57 @@ already drops down to REF-walk and tries again with appropriate
locking. This neatly handles all cases, so adding extra checks on
rename_lock would bring no significant value.
-`unlazy walk()` and `complete_walk()`
+``unlazy walk()`` and ``complete_walk()``
-------------------------------------
That "dropping down to REF-walk" typically involves a call to
-`unlazy_walk()`, so named because "RCU-walk" is also sometimes
-referred to as "lazy walk". `unlazy_walk()` is called when
+``unlazy_walk()``, so named because "RCU-walk" is also sometimes
+referred to as "lazy walk". ``unlazy_walk()`` is called when
following the path down to the current vfsmount/dentry pair seems to
have proceeded successfully, but the next step is problematic. This
can happen if the next name cannot be found in the dcache, if
permission checking or name revalidation couldn't be achieved while
-the `rcu_read_lock()` is held (which forbids sleeping), if an
+the ``rcu_read_lock()`` is held (which forbids sleeping), if an
automount point is found, or in a couple of cases involving symlinks.
-It is also called from `complete_walk()` when the lookup has reached
+It is also called from ``complete_walk()`` when the lookup has reached
the final component, or the very end of the path, depending on which
particular flavor of lookup is used.
Other reasons for dropping out of RCU-walk that do not trigger a call
-to `unlazy_walk()` are when some inconsistency is found that cannot be
-handled immediately, such as `mount_lock` or one of the `d_seq`
+to ``unlazy_walk()`` are when some inconsistency is found that cannot be
+handled immediately, such as ``mount_lock`` or one of the ``d_seq``
seqlocks reporting a change. In these cases the relevant function
-will return `-ECHILD` which will percolate up until it triggers a new
+will return ``-ECHILD`` which will percolate up until it triggers a new
attempt from the top using REF-walk.
-For those cases where `unlazy_walk()` is an option, it essentially
+For those cases where ``unlazy_walk()`` is an option, it essentially
takes a reference on each of the pointers that it holds (vfsmount,
dentry, and possibly some symbolic links) and then verifies that the
relevant seqlocks have not been changed. If there have been changes,
-it, too, aborts with `-ECHILD`, otherwise the transition to REF-walk
+it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
has been a success and the lookup process continues.
Taking a reference on those pointers is not quite as simple as just
incrementing a counter. That works to take a second reference if you
already have one (often indirectly through another object), but it
isn't sufficient if you don't actually have a counted reference at
-all. For `dentry->d_lockref`, it is safe to increment the reference
+all. For ``dentry->d_lockref``, it is safe to increment the reference
counter to get a reference unless it has been explicitly marked as
-"dead" which involves setting the counter to `-128`.
-`lockref_get_not_dead()` achieves this.
+"dead" which involves setting the counter to ``-128``.
+``lockref_get_not_dead()`` achieves this.
-For `mnt->mnt_count` it is safe to take a reference as long as
-`mount_lock` is then used to validate the reference. If that
+For ``mnt->mnt_count`` it is safe to take a reference as long as
+``mount_lock`` is then used to validate the reference. If that
validation fails, it may *not* be safe to just drop that reference in
-the standard way of calling `mnt_put()` - an unmount may have
-progressed too far. So the code in `legitimize_mnt()`, when it
+the standard way of calling ``mnt_put()`` - an unmount may have
+progressed too far. So the code in ``legitimize_mnt()``, when it
finds that the reference it got might not be safe, checks the
-`MNT_SYNC_UMOUNT` flag to determine if a simple `mnt_put()` is
+``MNT_SYNC_UMOUNT`` flag to determine if a simple ``mnt_put()`` is
correct, or if it should just decrement the count and pretend none of
this ever happened.
Taking care in filesystems
----------------------------
+--------------------------
RCU-walk depends almost entirely on cached information and often will
not call into the filesystem at all. However there are two places,
@@ -809,26 +855,26 @@ careful.
If the filesystem has non-standard permission-checking requirements -
such as a networked filesystem which may need to check with the server
-- the `i_op->permission` interface might be called during RCU-walk.
-In this case an extra "`MAY_NOT_BLOCK`" flag is passed so that it
-knows not to sleep, but to return `-ECHILD` if it cannot complete
-promptly. `i_op->permission` is given the inode pointer, not the
+- the ``i_op->permission`` interface might be called during RCU-walk.
+In this case an extra "``MAY_NOT_BLOCK``" flag is passed so that it
+knows not to sleep, but to return ``-ECHILD`` if it cannot complete
+promptly. ``i_op->permission`` is given the inode pointer, not the
dentry, so it doesn't need to worry about further consistency checks.
However if it accesses any other filesystem data structures, it must
-ensure they are safe to be accessed with only the `rcu_read_lock()`
-held. This typically means they must be freed using `kfree_rcu()` or
+ensure they are safe to be accessed with only the ``rcu_read_lock()``
+held. This typically means they must be freed using ``kfree_rcu()`` or
similar.
-[`READ_ONCE()`]: https://lwn.net/Articles/624126/
+.. _READ_ONCE: https://lwn.net/Articles/624126/
If the filesystem may need to revalidate dcache entries, then
-`d_op->d_revalidate` may be called in RCU-walk too. This interface
-*is* passed the dentry but does not have access to the `inode` or the
-`seq` number from the `nameidata`, so it needs to be extra careful
+``d_op->d_revalidate`` may be called in RCU-walk too. This interface
+*is* passed the dentry but does not have access to the ``inode`` or the
+``seq`` number from the ``nameidata``, so it needs to be extra careful
when accessing fields in the dentry. This "extra care" typically
-involves using [`READ_ONCE()`] to access fields, and verifying the
+involves using `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the
result is not NULL before using it. This pattern can be seen in
-`nfs_lookup_revalidate()`.
+``nfs_lookup_revalidate()``.
A pair of patterns
------------------
@@ -839,14 +885,14 @@ being aware of.
The first is "try quickly and check, if that fails try slowly". We
can see that in the high-level approach of first trying RCU-walk and
-then trying REF-walk, and in places where `unlazy_walk()` is used to
+then trying REF-walk, and in places where ``unlazy_walk()`` is used to
switch to REF-walk for the rest of the path. We also saw it earlier
-in `dget_parent()` when following a "`..`" link. It tries a quick way
+in ``dget_parent()`` when following a "``..``" link. It tries a quick way
to get a reference, then falls back to taking locks if needed.
The second pattern is "try quickly and check, if that fails try
-again - repeatedly". This is seen with the use of `rename_lock` and
-`mount_lock` in REF-walk. RCU-walk doesn't make use of this pattern -
+again - repeatedly". This is seen with the use of ``rename_lock`` and
+``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
if anything goes wrong it is much safer to just abort and try a more
sedate approach.
@@ -882,8 +928,8 @@ Conceptually, symbolic links could be handled by editing the path. If
a component name refers to a symbolic link, then that component is
replaced by the body of the link and, if that body starts with a '/',
then all preceding parts of the path are discarded. This is what the
-"`readlink -f`" command does, though it also edits out "`.`" and
-"`..`" components.
+"``readlink -f``" command does, though it also edits out "``.``" and
+"``..``" components.
Directly editing the path string is not really necessary when looking
up a path, and discarding early components is pointless as they aren't
@@ -899,19 +945,19 @@ There are two reasons for placing limits on how many symlinks can
occur in a single path lookup. The most obvious is to avoid loops.
If a symlink referred to itself either directly or through
intermediaries, then following the symlink can never complete
-successfully - the error `ELOOP` must be returned. Loops can be
+successfully - the error ``ELOOP`` must be returned. Loops can be
detected without imposing limits, but limits are the simplest solution
and, given the second reason for restriction, quite sufficient.
-[outlined recently]: http://thread.gmane.org/gmane.linux.kernel/1934390/focus=1934550
+.. _outlined recently: http://thread.gmane.org/gmane.linux.kernel/1934390/focus=1934550
-The second reason was [outlined recently] by Linus:
+The second reason was `outlined recently`_ by Linus:
-> Because it's a latency and DoS issue too. We need to react well to
-> true loops, but also to "very deep" non-loops. It's not about memory
-> use, it's about users triggering unreasonable CPU resources.
+ Because it's a latency and DoS issue too. We need to react well to
+ true loops, but also to "very deep" non-loops. It's not about memory
+ use, it's about users triggering unreasonable CPU resources.
-Linux imposes a limit on the length of any pathname: `PATH_MAX`, which
+Linux imposes a limit on the length of any pathname: ``PATH_MAX``, which
is 4096. There are a number of reasons for this limit; not letting the
kernel spend too much time on just one path is one of them. With
symbolic links you can effectively generate much longer paths so some
@@ -921,7 +967,7 @@ further limit of eight on the maximum depth of recursion, but that was
raised to 40 when a separate stack was implemented, so there is now
just the one limit.
-The `nameidata` structure that we met in an earlier article contains a
+The ``nameidata`` structure that we met in an earlier article contains a
small stack that can be used to store the remaining part of up to two
symlinks. In many cases this will be sufficient. If it isn't, a
separate stack is allocated with room for 40 symlinks. Pathname
@@ -941,13 +987,13 @@ to external storage. It is particularly important for RCU-walk to be
able to find and temporarily hold onto these cached entries, so that
it doesn't need to drop down into REF-walk.
-[object-oriented design pattern]: https://lwn.net/Articles/446317/
+.. _object-oriented design pattern: https://lwn.net/Articles/446317/
While each filesystem is free to make its own choice, symlinks are
typically stored in one of two places. Short symlinks are often
-stored directly in the inode. When a filesystem allocates a `struct
-inode` it typically allocates extra space to store private data (a
-common [object-oriented design pattern] in the kernel). This will
+stored directly in the inode. When a filesystem allocates a ``struct
+inode`` it typically allocates extra space to store private data (a
+common `object-oriented design pattern`_ in the kernel). This will
sometimes include space for a symlink. The other common location is
in the page cache, which normally stores the content of files. The
pathname in a symlink can be seen as the content of that symlink and
@@ -962,13 +1008,13 @@ the inode which, itself, is protected by RCU or by a counted reference
on the dentry. This means that the mechanisms that pathname lookup
uses to access the dcache and icache (inode cache) safely are quite
sufficient for accessing some cached symlinks safely. In these cases,
-the `i_link` pointer in the inode is set to point to wherever the
+the ``i_link`` pointer in the inode is set to point to wherever the
symlink is stored and it can be accessed directly whenever needed.
When the symlink is stored in the page cache or elsewhere, the
situation is not so straightforward. A reference on a dentry or even
on an inode does not imply any reference on cached pages of that
-inode, and even an `rcu_read_lock()` is not sufficient to ensure that
+inode, and even an ``rcu_read_lock()`` is not sufficient to ensure that
a page will not disappear. So for these symlinks the pathname lookup
code needs to ask the filesystem to provide a stable reference and,
significantly, needs to release that reference when it is finished
@@ -978,48 +1024,48 @@ Taking a reference to a cache page is often possible even in RCU-walk
mode. It does require making changes to memory, which is best avoided,
but that isn't necessarily a big cost and it is better than dropping
out of RCU-walk mode completely. Even filesystems that allocate
-space to copy the symlink into can use `GFP_ATOMIC` to often successfully
+space to copy the symlink into can use ``GFP_ATOMIC`` to often successfully
allocate memory without the need to drop out of RCU-walk. If a
filesystem cannot successfully get a reference in RCU-walk mode, it
-must return `-ECHILD` and `unlazy_walk()` will be called to return to
+must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
REF-walk mode in which the filesystem is allowed to sleep.
-The place for all this to happen is the `i_op->follow_link()` inode
+The place for all this to happen is the ``i_op->follow_link()`` inode
method. In the present mainline code this is never actually called in
RCU-walk mode as the rewrite is not quite complete. It is likely that
-in a future release this method will be passed an `inode` pointer when
+in a future release this method will be passed an ``inode`` pointer when
called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
-validated pointer. Much like the `i_op->permission()` method we
-looked at previously, `->follow_link()` would need to be careful that
+validated pointer. Much like the ``i_op->permission()`` method we
+looked at previously, ``->follow_link()`` would need to be careful that
all the data structures it references are safe to be accessed while
holding no counted reference, only the RCU lock. Though getting a
-reference with `->follow_link()` is not yet done in RCU-walk mode, the
+reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
code is ready to release the reference when that does happen.
This need to drop the reference to a symlink adds significant
complexity. It requires a reference to the inode so that the
-`i_op->put_link()` inode operation can be called. In REF-walk, that
+``i_op->put_link()`` inode operation can be called. In REF-walk, that
reference is kept implicitly through a reference to the dentry, so
-keeping the `struct path` of the symlink is easiest. For RCU-walk,
+keeping the ``struct path`` of the symlink is easiest. For RCU-walk,
the pointer to the inode is kept separately. To allow switching from
RCU-walk back to REF-walk in the middle of processing nested symlinks
we also need the seq number for the dentry so we can confirm that
switching back was safe.
Finally, when providing a reference to a symlink, the filesystem also
-provides an opaque "cookie" that must be passed to `->put_link()` so that it
+provides an opaque "cookie" that must be passed to ``->put_link()`` so that it
knows what to free. This might be the allocated memory area, or a
-pointer to the `struct page` in the page cache, or something else
+pointer to the ``struct page`` in the page cache, or something else
completely. Only the filesystem knows what it is.
In order for the reference to each symlink to be dropped when the walk completes,
whether in RCU-walk or REF-walk, the symlink stack needs to contain,
along with the path remnants:
-- the `struct path` to provide a reference to the inode in REF-walk
-- the `struct inode *` to provide a reference to the inode in RCU-walk
-- the `seq` to allow the path to be safely switched from RCU-walk to REF-walk
-- the `cookie` that tells `->put_path()` what to put.
+- the ``struct path`` to provide a reference to the inode in REF-walk
+- the ``struct inode *`` to provide a reference to the inode in RCU-walk
+- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
+- the ``cookie`` that tells ``->put_path()`` what to put.
This means that each entry in the symlink stack needs to hold five
pointers and an integer instead of just one pointer (the path
@@ -1028,28 +1074,28 @@ with 40 entries it adds up to 1600 bytes total, which is less than
half a page. So it might seem like a lot, but is by no means
excessive.
-Note that, in a given stack frame, the path remnant (`name`) is not
+Note that, in a given stack frame, the path remnant (``name``) is not
part of the symlink that the other fields refer to. It is the remnant
to be followed once that symlink has been fully parsed.
Following the symlink
---------------------
-The main loop in `link_path_walk()` iterates seamlessly over all
+The main loop in ``link_path_walk()`` iterates seamlessly over all
components in the path and all of the non-final symlinks. As symlinks
-are processed, the `name` pointer is adjusted to point to a new
+are processed, the ``name`` pointer is adjusted to point to a new
symlink, or is restored from the stack, so that much of the loop
-doesn't need to notice. Getting this `name` variable on and off the
+doesn't need to notice. Getting this ``name`` variable on and off the
stack is very straightforward; pushing and popping the references is
a little more complex.
-When a symlink is found, `walk_component()` returns the value `1`
-(`0` is returned for any other sort of success, and a negative number
-is, as usual, an error indicator). This causes `get_link()` to be
+When a symlink is found, ``walk_component()`` returns the value ``1``
+(``0`` is returned for any other sort of success, and a negative number
+is, as usual, an error indicator). This causes ``get_link()`` to be
called; it then gets the link from the filesystem. Providing that
-operation is successful, the old path `name` is placed on the stack,
-and the new value is used as the `name` for a while. When the end of
-the path is found (i.e. `*name` is `'\0'`) the old `name` is restored
+operation is successful, the old path ``name`` is placed on the stack,
+and the new value is used as the ``name`` for a while. When the end of
+the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
off the stack and path walking continues.
Pushing and popping the reference pointers (inode, cookie, etc.) is more
@@ -1060,113 +1106,114 @@ the symlink-just-found to avoid leaving empty path remnants that would
just get in the way.
It is most convenient to push the new symlink references onto the
-stack in `walk_component()` immediately when the symlink is found;
-`walk_component()` is also the last piece of code that needs to look at the
+stack in ``walk_component()`` immediately when the symlink is found;
+``walk_component()`` is also the last piece of code that needs to look at the
old symlink as it walks that last component. So it is quite
-convenient for `walk_component()` to release the old symlink and pop
+convenient for ``walk_component()`` to release the old symlink and pop
the references just before pushing the reference information for the
-new symlink. It is guided in this by two flags; `WALK_GET`, which
+new symlink. It is guided in this by two flags; ``WALK_GET``, which
gives it permission to follow a symlink if it finds one, and
-`WALK_PUT`, which tells it to release the current symlink after it has been
-followed. `WALK_PUT` is tested first, leading to a call to
-`put_link()`. `WALK_GET` is tested subsequently (by
-`should_follow_link()`) leading to a call to `pick_link()` which sets
+``WALK_PUT``, which tells it to release the current symlink after it has been
+followed. ``WALK_PUT`` is tested first, leading to a call to
+``put_link()``. ``WALK_GET`` is tested subsequently (by
+``should_follow_link()``) leading to a call to ``pick_link()`` which sets
up the stack frame.
-### Symlinks with no final component ###
+Symlinks with no final component
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A pair of special-case symlinks deserve a little further explanation.
-Both result in a new `struct path` (with mount and dentry) being set
-up in the `nameidata`, and result in `get_link()` returning `NULL`.
+Both result in a new ``struct path`` (with mount and dentry) being set
+up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
-The more obvious case is a symlink to "`/`". All symlinks starting
-with "`/`" are detected in `get_link()` which resets the `nameidata`
+The more obvious case is a symlink to "``/``". All symlinks starting
+with "``/``" are detected in ``get_link()`` which resets the ``nameidata``
to point to the effective filesystem root. If the symlink only
-contains "`/`" then there is nothing more to do, no components at all,
-so `NULL` is returned to indicate that the symlink can be released and
+contains "``/``" then there is nothing more to do, no components at all,
+so ``NULL`` is returned to indicate that the symlink can be released and
the stack frame discarded.
-The other case involves things in `/proc` that look like symlinks but
-aren't really.
+The other case involves things in ``/proc`` that look like symlinks but
+aren't really::
-> $ ls -l /proc/self/fd/1
-> lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
+ $ ls -l /proc/self/fd/1
+ lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
-Every open file descriptor in any process is represented in `/proc` by
+Every open file descriptor in any process is represented in ``/proc`` by
something that looks like a symlink. It is really a reference to the
-target file, not just the name of it. When you `readlink` these
+target file, not just the name of it. When you ``readlink`` these
objects you get a name that might refer to the same file - unless it
-has been unlinked or mounted over. When `walk_component()` follows
-one of these, the `->follow_link()` method in "procfs" doesn't return
-a string name, but instead calls `nd_jump_link()` which updates the
-`nameidata` in place to point to that target. `->follow_link()` then
-returns `NULL`. Again there is no final component and `get_link()`
-reports this by leaving the `last_type` field of `nameidata` as
-`LAST_BIND`.
+has been unlinked or mounted over. When ``walk_component()`` follows
+one of these, the ``->follow_link()`` method in "procfs" doesn't return
+a string name, but instead calls ``nd_jump_link()`` which updates the
+``nameidata`` in place to point to that target. ``->follow_link()`` then
+returns ``NULL``. Again there is no final component and ``get_link()``
+reports this by leaving the ``last_type`` field of ``nameidata`` as
+``LAST_BIND``.
Following the symlink in the final component
--------------------------------------------
-All this leads to `link_path_walk()` walking down every component, and
+All this leads to ``link_path_walk()`` walking down every component, and
following all symbolic links it finds, until it reaches the final
-component. This is just returned in the `last` field of `nameidata`.
+component. This is just returned in the ``last`` field of ``nameidata``.
For some callers, this is all they need; they want to create that
-`last` name if it doesn't exist or give an error if it does. Other
+``last`` name if it doesn't exist or give an error if it does. Other
callers will want to follow a symlink if one is found, and possibly
apply special handling to the last component of that symlink, rather
than just the last component of the original file name. These callers
-potentially need to call `link_path_walk()` again and again on
+potentially need to call ``link_path_walk()`` again and again on
successive symlinks until one is found that doesn't point to another
symlink.
-This case is handled by the relevant caller of `link_path_walk()`, such as
-`path_lookupat()` using a loop that calls `link_path_walk()`, and then
+This case is handled by the relevant caller of ``link_path_walk()``, such as
+``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
handles the final component. If the final component is a symlink
-that needs to be followed, then `trailing_symlink()` is called to set
-things up properly and the loop repeats, calling `link_path_walk()`
+that needs to be followed, then ``trailing_symlink()`` is called to set
+things up properly and the loop repeats, calling ``link_path_walk()``
again. This could loop as many as 40 times if the last component of
each symlink is another symlink.
The various functions that examine the final component and possibly
-report that it is a symlink are `lookup_last()`, `mountpoint_last()`
-and `do_last()`, each of which use the same convention as
-`walk_component()` of returning `1` if a symlink was found that needs
+report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
+and ``do_last()``, each of which use the same convention as
+``walk_component()`` of returning ``1`` if a symlink was found that needs
to be followed.
-Of these, `do_last()` is the most interesting as it is used for
-opening a file. Part of `do_last()` runs with `i_mutex` held and this
-part is in a separate function: `lookup_open()`.
+Of these, ``do_last()`` is the most interesting as it is used for
+opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this
+part is in a separate function: ``lookup_open()``.
-Explaining `do_last()` completely is beyond the scope of this article,
+Explaining ``do_last()`` completely is beyond the scope of this article,
but a few highlights should help those interested in exploring the
code.
-1. Rather than just finding the target file, `do_last()` needs to open
- it. If the file was found in the dcache, then `vfs_open()` is used for
- this. If not, then `lookup_open()` will either call `atomic_open()` (if
- the filesystem provides it) to combine the final lookup with the open, or
- will perform the separate `lookup_real()` and `vfs_create()` steps
- directly. In the later case the actual "open" of this newly found or
- created file will be performed by `vfs_open()`, just as if the name
- were found in the dcache.
-
-2. `vfs_open()` can fail with `-EOPENSTALE` if the cached information
- wasn't quite current enough. Rather than restarting the lookup from
- the top with `LOOKUP_REVAL` set, `lookup_open()` is called instead,
- giving the filesystem a chance to resolve small inconsistencies.
- If that doesn't work, only then is the lookup restarted from the top.
+1. Rather than just finding the target file, ``do_last()`` needs to open
+ it. If the file was found in the dcache, then ``vfs_open()`` is used for
+ this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if
+ the filesystem provides it) to combine the final lookup with the open, or
+ will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
+ directly. In the later case the actual "open" of this newly found or
+ created file will be performed by ``vfs_open()``, just as if the name
+ were found in the dcache.
+
+2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
+ wasn't quite current enough. Rather than restarting the lookup from
+ the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
+ giving the filesystem a chance to resolve small inconsistencies.
+ If that doesn't work, only then is the lookup restarted from the top.
3. An open with O_CREAT **does** follow a symlink in the final component,
- unlike other creation system calls (like `mkdir`). So the sequence:
+ unlike other creation system calls (like ``mkdir``). So the sequence::
- > ln -s bar /tmp/foo
- > echo hello > /tmp/foo
+ ln -s bar /tmp/foo
+ echo hello > /tmp/foo
- will create a file called `/tmp/bar`. This is not permitted if
- `O_EXCL` is set but otherwise is handled for an O_CREAT open much
- like for a non-creating open: `should_follow_link()` returns `1`, and
- so does `do_last()` so that `trailing_symlink()` gets called and the
- open process continues on the symlink that was found.
+ will create a file called ``/tmp/bar``. This is not permitted if
+ ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
+ like for a non-creating open: ``should_follow_link()`` returns ``1``, and
+ so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
+ open process continues on the symlink that was found.
Updating the access time
------------------------
@@ -1180,110 +1227,112 @@ footprints are best kept to a minimum.
One other place where walking down a symlink can involve leaving
footprints in a way that doesn't affect directories is in updating access times.
In Unix (and Linux) every filesystem object has a "last accessed
-time", or "`atime`". Passing through a directory to access a file
+time", or "``atime``". Passing through a directory to access a file
within is not considered to be an access for the purposes of
-`atime`; only listing the contents of a directory can update its `atime`.
-Symlinks are different it seems. Both reading a symlink (with `readlink()`)
+``atime``; only listing the contents of a directory can update its ``atime``.
+Symlinks are different it seems. Both reading a symlink (with ``readlink()``)
and looking up a symlink on the way to some other destination can
update the atime on that symlink.
-[clearest statement]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
+.. _clearest statement: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
It is not clear why this is the case; POSIX has little to say on the
-subject. The [clearest statement] is that, if a particular implementation
+subject. The `clearest statement`_ is that, if a particular implementation
updates a timestamp in a place not specified by POSIX, this must be
documented "except that any changes caused by pathname resolution need
not be documented". This seems to imply that POSIX doesn't really
care about access-time updates during pathname lookup.
-[Linux 1.3.87]: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/fs/ext2/symlink.c?id=f806c6db77b8eaa6e00dcfb6b567706feae8dbb8
+.. _Linux 1.3.87: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/fs/ext2/symlink.c?id=f806c6db77b8eaa6e00dcfb6b567706feae8dbb8
-An examination of history shows that prior to [Linux 1.3.87], the ext2
+An examination of history shows that prior to `Linux 1.3.87`_, the ext2
filesystem, at least, didn't update atime when following a link.
Unfortunately we have no record of why that behavior was changed.
In any case, access time must now be updated and that operation can be
quite complex. Trying to stay in RCU-walk while doing it is best
-avoided. Fortunately it is often permitted to skip the `atime`
-update. Because `atime` updates cause performance problems in various
-areas, Linux supports the `relatime` mount option, which generally
-limits the updates of `atime` to once per day on files that aren't
+avoided. Fortunately it is often permitted to skip the ``atime``
+update. Because ``atime`` updates cause performance problems in various
+areas, Linux supports the ``relatime`` mount option, which generally
+limits the updates of ``atime`` to once per day on files that aren't
being changed (and symlinks never change once created). Even without
-`relatime`, many filesystems record `atime` with a one-second
+``relatime``, many filesystems record ``atime`` with a one-second
granularity, so only one update per second is required.
-It is easy to test if an `atime` update is needed while in RCU-walk
+It is easy to test if an ``atime`` update is needed while in RCU-walk
mode and, if it isn't, the update can be skipped and RCU-walk mode
-continues. Only when an `atime` update is actually required does the
+continues. Only when an ``atime`` update is actually required does the
path walk drop down to REF-walk. All of this is handled in the
-`get_link()` function.
+``get_link()`` function.
A few flags
-----------
A suitable way to wrap up this tour of pathname walking is to list
-the various flags that can be stored in the `nameidata` to guide the
+the various flags that can be stored in the ``nameidata`` to guide the
lookup process. Many of these are only meaningful on the final
component, others reflect the current state of the pathname lookup.
-And then there is `LOOKUP_EMPTY`, which doesn't fit conceptually with
+And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
the others. If this is not set, an empty pathname causes an error
very early on. If it is set, empty pathnames are not considered to be
an error.
-### Global state flags ###
+Global state flags
+~~~~~~~~~~~~~~~~~~
-We have already met two global state flags: `LOOKUP_RCU` and
-`LOOKUP_REVAL`. These select between one of three overall approaches
+We have already met two global state flags: ``LOOKUP_RCU`` and
+``LOOKUP_REVAL``. These select between one of three overall approaches
to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
-`LOOKUP_PARENT` indicates that the final component hasn't been reached
+``LOOKUP_PARENT`` indicates that the final component hasn't been reached
yet. This is primarily used to tell the audit subsystem the full
context of a particular access being audited.
-`LOOKUP_ROOT` indicates that the `root` field in the `nameidata` was
+``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was
provided by the caller, so it shouldn't be released when it is no
longer needed.
-`LOOKUP_JUMPED` means that the current dentry was chosen not because
+``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
-following "`..`", following a symlink to `/`, crossing a mount point
-or accessing a "`/proc/$PID/fd/$FD`" symlink. In this case the
+following "``..``", following a symlink to ``/``, crossing a mount point
+or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
filesystem has not been asked to revalidate the name (with
-`d_revalidate()`). In such cases the inode may still need to be
-revalidated, so `d_op->d_weak_revalidate()` is called if
-`LOOKUP_JUMPED` is set when the look completes - which may be at the
+``d_revalidate()``). In such cases the inode may still need to be
+revalidated, so ``d_op->d_weak_revalidate()`` is called if
+``LOOKUP_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
-### Final-component flags ###
+Final-component flags
+~~~~~~~~~~~~~~~~~~~~~
Some of these flags are only set when the final component is being
considered. Others are only checked for when considering that final
component.
-`LOOKUP_AUTOMOUNT` ensures that, if the final component is an automount
+``LOOKUP_AUTOMOUNT`` ensures that, if the final component is an automount
point, then the mount is triggered. Some operations would trigger it
-anyway, but operations like `stat()` deliberately don't. `statfs()`
-needs to trigger the mount but otherwise behaves a lot like `stat()`, so
-it sets `LOOKUP_AUTOMOUNT`, as does "`quotactl()`" and the handling of
-"`mount --bind`".
+anyway, but operations like ``stat()`` deliberately don't. ``statfs()``
+needs to trigger the mount but otherwise behaves a lot like ``stat()``, so
+it sets ``LOOKUP_AUTOMOUNT``, as does "``quotactl()``" and the handling of
+"``mount --bind``".
-`LOOKUP_FOLLOW` has a similar function to `LOOKUP_AUTOMOUNT` but for
+``LOOKUP_FOLLOW`` has a similar function to ``LOOKUP_AUTOMOUNT`` but for
symlinks. Some system calls set or clear it implicitly, while
-others have API flags such as `AT_SYMLINK_FOLLOW` and
-`UMOUNT_NOFOLLOW` to control it. Its effect is similar to
-`WALK_GET` that we already met, but it is used in a different way.
+others have API flags such as ``AT_SYMLINK_FOLLOW`` and
+``UMOUNT_NOFOLLOW`` to control it. Its effect is similar to
+``WALK_GET`` that we already met, but it is used in a different way.
-`LOOKUP_DIRECTORY` insists that the final component is a directory.
+``LOOKUP_DIRECTORY`` insists that the final component is a directory.
Various callers set this and it is also set when the final component
is found to be followed by a slash.
-Finally `LOOKUP_OPEN`, `LOOKUP_CREATE`, `LOOKUP_EXCL`, and
-`LOOKUP_RENAME_TARGET` are not used directly by the VFS but are made
-available to the filesystem and particularly the `->d_revalidate()`
+Finally ``LOOKUP_OPEN``, ``LOOKUP_CREATE``, ``LOOKUP_EXCL``, and
+``LOOKUP_RENAME_TARGET`` are not used directly by the VFS but are made
+available to the filesystem and particularly the ``->d_revalidate()``
method. A filesystem can choose not to bother revalidating too hard
if it knows that it will be asked to open or create the file soon.
-These flags were previously useful for `->lookup()` too but with the
-introduction of `->atomic_open()` they are less relevant there.
+These flags were previously useful for ``->lookup()`` too but with the
+introduction of ``->atomic_open()`` they are less relevant there.
End of the road
---------------
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
deleted file mode 100644
index 106d17fbb05f..000000000000
--- a/Documentation/filesystems/pohmelfs/design_notes.txt
+++ /dev/null
@@ -1,72 +0,0 @@
-POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
-
- Evgeniy Polyakov <zbr@ioremap.net>
-
-Homepage: http://www.ioremap.net/projects/pohmelfs
-
-POHMELFS first began as a network filesystem with coherent local data and
-metadata caches but is now evolving into a parallel distributed filesystem.
-
-Main features of this FS include:
- * Locally coherent cache for data and metadata with (potentially) byte-range locks.
- Since all Linux filesystems lock the whole inode during writing, algorithm
- is very simple and does not use byte-ranges, although they are sent in
- locking messages.
- * Completely async processing of all events except creation of hard and symbolic
- links, and rename events.
- Object creation and data reading and writing are processed asynchronously.
- * Flexible object architecture optimized for network processing.
- Ability to create long paths to objects and remove arbitrarily huge
- directories with a single network command.
- (like removing the whole kernel tree via a single network command).
- * Very high performance.
- * Fast and scalable multithreaded userspace server. Being in userspace it works
- with any underlying filesystem and still is much faster than async in-kernel NFS one.
- * Client is able to switch between different servers (if one goes down, client
- automatically reconnects to second and so on).
- * Transactions support. Full failover for all operations.
- Resending transactions to different servers on timeout or error.
- * Read request (data read, directory listing, lookup requests) balancing between multiple servers.
- * Write requests are replicated to multiple servers and completed only when all of them are acked.
- * Ability to add and/or remove servers from the working set at run-time.
- * Strong authentication and possible data encryption in network channel.
- * Extended attributes support.
-
-POHMELFS is based on transactions, which are potentially long-standing objects that live
-in the client's memory. Each transaction contains all the information needed to process a given
-command (or set of commands, which is frequently used during data writing: single transactions
-can contain creation and data writing commands). Transactions are committed by all the servers
-to which they are sent and, in case of failures, are eventually resent or dropped with an error.
-For example, reading will return an error if no servers are available.
-
-POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
-possible to detach replies from requests and, if the command requires data to be received, the
-caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
-servers and async threads will pick up replies in parallel, find appropriate transactions in the
-system and put the data where it belongs (like the page or inode cache).
-
-The main feature of POHMELFS is writeback data and the metadata cache.
-Only a few non-performance critical operations use the write-through cache and
-are synchronous: hard and symbolic link creation, and object rename. Creation,
-removal of objects and data writing are asynchronous and are sent to
-the server during system writeback. Only one writer at a time is allowed for any
-given inode, which is guarded by an appropriate locking protocol.
-Because of this feature, POHMELFS is extremely fast at metadata intensive
-workloads and can fully utilize the bandwidth to the servers when doing bulk
-data transfers.
-
-POHMELFS clients operate with a working set of servers and are capable of balancing read-only
-operations (like lookups or directory listings) between them according to IO priorities.
-Administrators can add or remove servers from the set at run-time via special commands (described
-in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which
-are connected with write permission turned on. IO priority and permissions can be changed in
-run-time.
-
-POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
-One can select any kernel supported cipher, encryption mode, hash type and operation mode
-(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
-is checked during mount time and, if the server does not support it, appropriate capabilities
-will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
-Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
-crypto operations and send the resulting data to server or submit it up the stack. This number
-can be controlled via a mount option.
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
deleted file mode 100644
index db2e41393626..000000000000
--- a/Documentation/filesystems/pohmelfs/info.txt
+++ /dev/null
@@ -1,99 +0,0 @@
-POHMELFS usage information.
-
-Mount options.
-All but index, number of crypto threads and maximum IO size can changed via remount.
-
-idx=%u
- Each mountpoint is associated with a special index via this option.
- Administrator can add or remove servers from the given index, so all mounts,
- which were attached to it, are updated.
- Default it is 0.
-
-trans_scan_timeout=%u
- This timeout, expressed in milliseconds, specifies time to scan transaction
- trees looking for stale requests, which have to be resent, or if number of
- retries exceed specified limit, dropped with error.
- Default is 5 seconds.
-
-drop_scan_timeout=%u
- Internal timeout, expressed in milliseconds, which specifies how frequently
- inodes marked to be dropped are freed. It also specifies how frequently
- the system checks that servers have to be added or removed from current working set.
- Default is 1 second.
-
-wait_on_page_timeout=%u
- Number of milliseconds to wait for reply from remote server for data reading command.
- If this timeout is exceeded, reading returns an error.
- Default is 5 seconds.
-
-trans_retries=%u
- This is the number of times that a transaction will be resent to a server that did
- not answer for the last @trans_scan_timeout milliseconds.
- When the number of resends exceeds this limit, the transaction is completed with error.
- Default is 5 resends.
-
-crypto_thread_num=%u
- Number of crypto processing threads. Threads are used both for RX and TX traffic.
- Default is 2, or no threads if crypto operations are not supported.
-
-trans_max_pages=%u
- Maximum number of pages in a single transaction. This parameter also controls
- the number of pages, allocated for crypto processing (each crypto thread has
- pool of pages, the number of which is equal to 'trans_max_pages'.
- Default is 100 pages.
-
-crypto_fail_unsupported
- If specified, mount will fail if the server does not support requested crypto operations.
- By default mount will disable non-matching crypto operations.
-
-mcache_timeout=%u
- Maximum number of milliseconds to wait for the mcache objects to be processed.
- Mcache includes locks (given lock should be granted by server), attributes (they should be
- fully received in the given timeframe).
- Default is 5 seconds.
-
-Usage examples.
-
-Add server server1.net:1025 into the working set with index $idx
-with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
-$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
-
-Mount filesystem with given index $idx to /mnt mountpoint.
-Client will connect to all servers specified in the working set via previous command:
-mount -t pohmel -o idx=$idx q /mnt
-
-Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw):
-$cfg A modify -a server1.net -p 1025 -i $idx -I 1
-
-Change IO priority to 123 (node with the highest priority gets read requests).
-$cfg A modify -a server1.net -p 1025 -i $idx -P 123
-
-One can check currect status of all connections in the mountstats file:
-# cat /proc/$PID/mountstats
-...
-device none mounted on /mnt with fstype pohmel
-idx addr(:port) socket_type protocol active priority permissions
-0 server1.net:1026 1 6 1 250 1
-0 server2.net:1025 1 6 1 123 3
-
-Server installation.
-
-Creating a server, which listens at port 1025 and 0.0.0.0 address.
-Working root directory (note, that server chroots there, so you have to have appropriate permissions)
-is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
-are appropriate key files.
-Number of working threads is set to 10.
-
-# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
-
- -A 6 - listen on ipv6 address. Default: Disabled.
- -r root - path to root directory. Default: /tmp.
- -a addr - listen address. Default: 0.0.0.0.
- -p port - listen port. Default: 1025.
- -w workers - number of workers per connected client. Default: 1.
- -K file - hash key size. Default: none.
- -k file - cipher key size. Default: none.
- -h - this help.
-
-Number of worker threads specifies how many workers will be created for each client.
-Bulk single-client transafers usually are better handled with smaller number (like 1-3).
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
deleted file mode 100644
index c680b4b5353d..000000000000
--- a/Documentation/filesystems/pohmelfs/network_protocol.txt
+++ /dev/null
@@ -1,227 +0,0 @@
-POHMELFS network protocol.
-
-Basic structure used in network communication is following command:
-
-struct netfs_cmd
-{
- __u16 cmd; /* Command number */
- __u16 csize; /* Attached crypto information size */
- __u16 cpad; /* Attached padding size */
- __u16 ext; /* External flags */
- __u32 size; /* Size of the attached data */
- __u32 trans; /* Transaction id */
- __u64 id; /* Object ID to operate on. Used for feedback.*/
- __u64 start; /* Start of the object. */
- __u64 iv; /* IV sequence */
- __u8 data[0];
-};
-
-Commands can be embedded into transaction command (which in turn has own command),
-so one can extend protocol as needed without breaking backward compatibility as long
-as old commands are supported. All string lengths include tail 0 byte.
-
-All commands are transferred over the network in big-endian. CPU endianness is used at the end peers.
-
-@cmd - command number, which specifies command to be processed. Following
- commands are used currently:
-
- NETFS_READDIR = 1, /* Read directory for given inode number */
- NETFS_READ_PAGE, /* Read data page from the server */
- NETFS_WRITE_PAGE, /* Write data page to the server */
- NETFS_CREATE, /* Create directory entry */
- NETFS_REMOVE, /* Remove directory entry */
- NETFS_LOOKUP, /* Lookup single object */
- NETFS_LINK, /* Create a link */
- NETFS_TRANS, /* Transaction */
- NETFS_OPEN, /* Open intent */
- NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
- NETFS_PAGE_CACHE, /* Page cache invalidation message */
- NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
- NETFS_RENAME, /* Rename object */
- NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */
- NETFS_LOCK, /* Distributed lock message */
- NETFS_XATTR_SET, /* Set extended attribute */
- NETFS_XATTR_GET, /* Get extended attribute */
-
-@ext - external flags. Used by different commands to specify some extra arguments
- like partial size of the embedded objects or creation flags.
-
-@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
- but size of the requested data is incorporated here. It does not include size of the command
- header (struct netfs_cmd) itself.
-
-@id - id of the object this command operates on. Each command can use it for own purpose.
-
-@start - start of the object this command operates on. Each command can use it for own purpose.
-
-@csize, @cpad - size and padding size of the (attached if needed) crypto information.
-
-Command specifications.
-
-@NETFS_READDIR
-This command is used to sync content of the remote dir to the client.
-
-@ext - length of the path to object.
-@size - the same.
-@id - local inode number of the directory to read.
-@start - zero.
-
-
-@NETFS_READ_PAGE
-This command is used to read data from remote server.
-Data size does not exceed local page cache size.
-
-@id - inode number.
-@start - first byte offset.
-@size - number of bytes to read plus length of the path to object.
-@ext - object path length.
-
-
-@NETFS_CREATE
-Used to create object.
-It does not require that all directories on top of the object were
-already created, it will create them automatically. Each object has
-associated @netfs_path_entry data structure, which contains creation
-mode (permissions and type) and length of the name as long as name itself.
-
-@start - 0
-@size - size of the all data structures needed to create a path
-@id - local inode number
-@ext - 0
-
-
-@NETFS_REMOVE
-Used to remove object.
-
-@ext - length of the path to object.
-@size - the same.
-@id - local inode number.
-@start - zero.
-
-
-@NETFS_LOOKUP
-Lookup information about object on server.
-
-@ext - length of the path to object.
-@size - the same.
-@id - local inode number of the directory to look object in.
-@start - local inode number of the object to look at.
-
-
-@NETFS_LINK
-Create hard of symlink.
-Command is sent as "object_path|target_path".
-
-@size - size of the above string.
-@id - parent local inode number.
-@start - 1 for symlink, 0 for hardlink.
-@ext - size of the "object_path" above.
-
-
-@NETFS_TRANS
-Transaction header.
-
-@size - incorporates all embedded command sizes including theirs header sizes.
-@start - transaction generation number - unique id used to find transaction.
-@ext - transaction flags. Unused at the moment.
-@id - 0.
-
-
-@NETFS_OPEN
-Open intent for given transaction.
-
-@id - local inode number.
-@start - 0.
-@size - path length to the object.
-@ext - open flags (O_RDWR and so on).
-
-
-@NETFS_INODE_INFO
-Metadata update command.
-It is sent to servers when attributes of the object are changed and received
-when data or metadata were updated. It operates with the following structure:
-
-struct netfs_inode_info
-{
- unsigned int mode;
- unsigned int nlink;
- unsigned int uid;
- unsigned int gid;
- unsigned int blocksize;
- unsigned int padding;
- __u64 ino;
- __u64 blocks;
- __u64 rdev;
- __u64 size;
- __u64 version;
-};
-
-It effectively mirrors stat(2) returned data.
-
-
-@ext - path length to the object.
-@size - the same plus size of the netfs_inode_info structure.
-@id - local inode number.
-@start - 0.
-
-
-@NETFS_PAGE_CACHE
-Command is only received by clients. It contains information about
-page to be marked as not up-to-date.
-
-@id - client's inode number.
-@start - last byte of the page to be invalidated. If it is not equal to
- current inode size, it will be vmtruncated().
-@size - 0
-@ext - 0
-
-
-@NETFS_READ_PAGES
-Used to read multiple contiguous pages in one go.
-
-@start - first byte of the contiguous region to read.
-@size - contains of two fields: lower 8 bits are used to represent page cache shift
- used by client, another 3 bytes are used to get number of pages.
-@id - local inode number.
-@ext - path length to the object.
-
-
-@NETFS_RENAME
-Used to rename object.
-Attached data is formed into following string: "old_path|new_path".
-
-@id - local inode number.
-@start - parent inode number.
-@size - length of the above string.
-@ext - length of the old path part.
-
-
-@NETFS_CAPABILITIES
-Used to exchange crypto capabilities with server.
-If crypto capabilities are not supported by server, then client will disable it
-or fail (if 'crypto_fail_unsupported' mount options was specified).
-
-@id - superblock index. Used to specify crypto information for group of servers.
-@size - size of the attached capabilities structure.
-@start - 0.
-@size - 0.
-@scsize - 0.
-
-@NETFS_LOCK
-Used to send lock request/release messages. Although it sends byte range request
-and is capable of flushing pages based on that, it is not used, since all Linux
-filesystems lock the whole inode.
-
-@id - lock generation number.
-@start - start of the locked range.
-@size - size of the locked range.
-@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
- if it is lock request (1) or release (0).
-
-@NETFS_XATTR_SET
-@NETFS_XATTR_GET
-Used to set/get extended attributes for given inode.
-@id - attribute generation number or xattr setting type
-@start - size of the attribute (request or attached)
-@size - name length, path len and data size for given attribute
-@ext - path length for given object
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 7b7b845c490a..cf43bc4dbf31 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -622,3 +622,19 @@ in your dentry operations instead.
alloc_file_clone(file, flags, ops) does not affect any caller's references.
On success you get a new struct file sharing the mount/dentry with the
original, on failure - ERR_PTR().
+--
+[mandatory]
+ ->clone_file_range() and ->dedupe_file_range have been replaced with
+ ->remap_file_range(). See Documentation/filesystems/vfs.txt for more
+ information.
+--
+[recommended]
+ ->lookup() instances doing an equivalent of
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+ return d_splice_alias(inode, dentry);
+ don't need to bother with the check - d_splice_alias() will do the
+ right thing when given ERR_PTR(...) as inode. Moreover, passing NULL
+ inode to d_splice_alias() will also do the right thing (equivalent of
+ d_add(dentry, NULL); return NULL;), so that kind of special cases
+ also doesn't need a separate treatment.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22b4b00dee31..66cad5c86171 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -125,6 +125,13 @@ process running on the system, which is named after the process ID (PID).
The link self points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
+Note that an open a file descriptor to /proc/<pid> or to any of its
+contained files or subdirectories does not prevent <pid> being reused
+for some other process in the event that <pid> exits. Operations on
+open /proc/<pid> file descriptors corresponding to dead processes
+never act on any new process that the kernel may, through chance, have
+also assigned the process ID <pid>. Instead, operations on these FDs
+usually fail with ESRCH.
Table 1-1: Process specific entries in /proc
..............................................................................
@@ -182,6 +189,7 @@ read the file /proc/PID/status:
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
+ THP_enabled: 1
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -193,8 +201,10 @@ read the file /proc/PID/status:
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
+ CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
+ Speculation_Store_Bypass: thread vulnerable
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 1
@@ -214,7 +224,7 @@ asynchronous manner and the value may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise.
-Table 1-2: Contents of the status files (as of 4.8)
+Table 1-2: Contents of the status files (as of 4.19)
..............................................................................
Field Content
Name filename of the executable
@@ -256,6 +266,8 @@ Table 1-2: Contents of the status files (as of 4.8)
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
@@ -267,8 +279,10 @@ Table 1-2: Contents of the status files (as of 4.8)
CapPrm bitmap of permitted capabilities
CapEff bitmap of effective capabilities
CapBnd bitmap of capabilities bounding set
+ CapAmb bitmap of ambient capabilities
NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
+ Speculation_Store_Bypass speculative store bypass mitigation status
Cpus_allowed mask of CPUs on which this process may run
Cpus_allowed_list Same as previous, but in "list format"
Mems_allowed mask of memory nodes allowed to this process
@@ -425,6 +439,7 @@ SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
+THPeligible: 0
VmFlags: rd ex mr mw me dw
the first of these lines shows the same information as is displayed for the
@@ -462,6 +477,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
+"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
+true, 0 otherwise.
"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
@@ -496,7 +513,9 @@ manner. The codes are the following:
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
-be vanished or the reverse -- new added.
+be vanished or the reverse -- new added. Interpretation of their meaning
+might change in future as well. So each consumer of these flags has to
+follow each specific kernel version for the exact semantic.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
@@ -858,6 +877,7 @@ Writeback: 0 kB
AnonPages: 861800 kB
Mapped: 280372 kB
Shmem: 644 kB
+KReclaimable: 168048 kB
Slab: 284364 kB
SReclaimable: 159856 kB
SUnreclaim: 124508 kB
@@ -925,6 +945,9 @@ AnonHugePages: Non-file backed huge pages mapped into userspace page tables
ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
with huge pages
ShmemPmdMapped: Shared memory mapped into userspace with huge pages
+KReclaimable: Kernel allocations that the kernel will attempt to reclaim
+ under memory pressure. Includes SReclaimable (below), and other
+ direct allocations with a shrinker.
Slab: in-kernel data structures cache
SReclaimable: Part of Slab, that might be reclaimed, such as caches
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt
index 4f3d6a882bdc..48ea68f15845 100644
--- a/Documentation/filesystems/qnx6.txt
+++ b/Documentation/filesystems/qnx6.txt
@@ -87,7 +87,7 @@ addressed with 16 direct blocks.
For more than 16 blocks an indirect addressing in form of another tree is
used. (scheme is the same as the one used for the superblock root nodes)
-The filesize is stored 64bit. Inode counting starts with 1. (whilst long
+The filesize is stored 64bit. Inode counting starts with 1. (while long
filename inodes start with 0)
Directories
@@ -155,7 +155,7 @@ Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
Each superblock got it's own half of the system area. So superblock #1
-always uses blocks from the lower half whilst superblock #2 just writes to
+always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
Bitmap blocks, Inode blocks and indirect addressing blocks for those two
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt
index 1343d118a9b2..eb9e3aa63026 100644
--- a/Documentation/filesystems/spufs.txt
+++ b/Documentation/filesystems/spufs.txt
@@ -452,7 +452,7 @@ RETURN VALUE
ERRORS
- EACCESS
+ EACCES
The current user does not have write access on the spufs mount
point.
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index a1426cabcef1..41411b0c60a3 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -344,7 +344,9 @@ struct bus_attribute {
Declaring:
-BUS_ATTR(_name, _mode, _show, _store)
+static BUS_ATTR_RW(name);
+static BUS_ATTR_RO(name);
+static BUS_ATTR_WO(name);
Creation/Removal:
diff --git a/Documentation/filesystems/ubifs-authentication.md b/Documentation/filesystems/ubifs-authentication.md
new file mode 100644
index 000000000000..028b3e2e25f9
--- /dev/null
+++ b/Documentation/filesystems/ubifs-authentication.md
@@ -0,0 +1,426 @@
+% UBIFS Authentication
+% sigma star gmbh
+% 2018
+
+# Introduction
+
+UBIFS utilizes the fscrypt framework to provide confidentiality for file
+contents and file names. This prevents attacks where an attacker is able to
+read contents of the filesystem on a single point in time. A classic example
+is a lost smartphone where the attacker is unable to read personal data stored
+on the device without the filesystem decryption key.
+
+At the current state, UBIFS encryption however does not prevent attacks where
+the attacker is able to modify the filesystem contents and the user uses the
+device afterwards. In such a scenario an attacker can modify filesystem
+contents arbitrarily without the user noticing. One example is to modify a
+binary to perform a malicious action when executed [DMC-CBC-ATTACK]. Since
+most of the filesystem metadata of UBIFS is stored in plain, this makes it
+fairly easy to swap files and replace their contents.
+
+Other full disk encryption systems like dm-crypt cover all filesystem metadata,
+which makes such kinds of attacks more complicated, but not impossible.
+Especially, if the attacker is given access to the device multiple points in
+time. For dm-crypt and other filesystems that build upon the Linux block IO
+layer, the dm-integrity or dm-verity subsystems [DM-INTEGRITY, DM-VERITY]
+can be used to get full data authentication at the block layer.
+These can also be combined with dm-crypt [CRYPTSETUP2].
+
+This document describes an approach to get file contents _and_ full metadata
+authentication for UBIFS. Since UBIFS uses fscrypt for file contents and file
+name encryption, the authentication system could be tied into fscrypt such that
+existing features like key derivation can be utilized. It should however also
+be possible to use UBIFS authentication without using encryption.
+
+
+## MTD, UBI & UBIFS
+
+On Linux, the MTD (Memory Technology Devices) subsystem provides a uniform
+interface to access raw flash devices. One of the more prominent subsystems that
+work on top of MTD is UBI (Unsorted Block Images). It provides volume management
+for flash devices and is thus somewhat similar to LVM for block devices. In
+addition, it deals with flash-specific wear-leveling and transparent I/O error
+handling. UBI offers logical erase blocks (LEBs) to the layers on top of it
+and maps them transparently to physical erase blocks (PEBs) on the flash.
+
+UBIFS is a filesystem for raw flash which operates on top of UBI. Thus, wear
+leveling and some flash specifics are left to UBI, while UBIFS focuses on
+scalability, performance and recoverability.
+
+
+
+ +------------+ +*******+ +-----------+ +-----+
+ | | * UBIFS * | UBI-BLOCK | | ... |
+ | JFFS/JFFS2 | +*******+ +-----------+ +-----+
+ | | +-----------------------------+ +-----------+ +-----+
+ | | | UBI | | MTD-BLOCK | | ... |
+ +------------+ +-----------------------------+ +-----------+ +-----+
+ +------------------------------------------------------------------+
+ | MEMORY TECHNOLOGY DEVICES (MTD) |
+ +------------------------------------------------------------------+
+ +-----------------------------+ +--------------------------+ +-----+
+ | NAND DRIVERS | | NOR DRIVERS | | ... |
+ +-----------------------------+ +--------------------------+ +-----+
+
+ Figure 1: Linux kernel subsystems for dealing with raw flash
+
+
+
+Internally, UBIFS maintains multiple data structures which are persisted on
+the flash:
+
+- *Index*: an on-flash B+ tree where the leaf nodes contain filesystem data
+- *Journal*: an additional data structure to collect FS changes before updating
+ the on-flash index and reduce flash wear.
+- *Tree Node Cache (TNC)*: an in-memory B+ tree that reflects the current FS
+ state to avoid frequent flash reads. It is basically the in-memory
+ representation of the index, but contains additional attributes.
+- *LEB property tree (LPT)*: an on-flash B+ tree for free space accounting per
+ UBI LEB.
+
+In the remainder of this section we will cover the on-flash UBIFS data
+structures in more detail. The TNC is of less importance here since it is never
+persisted onto the flash directly. More details on UBIFS can also be found in
+[UBIFS-WP].
+
+
+### UBIFS Index & Tree Node Cache
+
+Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types
+of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file
+contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes.
+Almost all types of nodes share a common header (`ubifs_ch`) containing basic
+information like node type, node length, a sequence number, etc. (see
+`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT
+and some less important node types like padding nodes which are used to pad
+unusable content at the end of LEBs.
+
+To avoid re-writing the whole B+ tree on every single change, it is implemented
+as *wandering tree*, where only the changed nodes are re-written and previous
+versions of them are obsoleted without erasing them right away. As a result,
+the index is not stored in a single place on the flash, but *wanders* around
+and there are obsolete parts on the flash as long as the LEB containing them is
+not reused by UBIFS. To find the most recent version of the index, UBIFS stores
+a special node called *master node* into UBI LEB 1 which always points to the
+most recent root node of the UBIFS index. For recoverability, the master node
+is additionally duplicated to LEB 2. Mounting UBIFS is thus a simple read of
+LEB 1 and 2 to get the current master node and from there get the location of
+the most recent on-flash index.
+
+The TNC is the in-memory representation of the on-flash index. It contains some
+additional runtime attributes per node which are not persisted. One of these is
+a dirty-flag which marks nodes that have to be persisted the next time the
+index is written onto the flash. The TNC acts as a write-back cache and all
+modifications of the on-flash index are done through the TNC. Like other caches,
+the TNC does not have to mirror the full index into memory, but reads parts of
+it from flash whenever needed. A *commit* is the UBIFS operation of updating the
+on-flash filesystem structures like the index. On every commit, the TNC nodes
+marked as dirty are written to the flash to update the persisted index.
+
+
+### Journal
+
+To avoid wearing out the flash, the index is only persisted (*commited*) when
+certain conditions are met (eg. `fsync(2)`). The journal is used to record
+any changes (in form of inode nodes, data nodes etc.) between commits
+of the index. During mount, the journal is read from the flash and replayed
+onto the TNC (which will be created on-demand from the on-flash index).
+
+UBIFS reserves a bunch of LEBs just for the journal called *log area*. The
+amount of log area LEBs is configured on filesystem creation (using
+`mkfs.ubifs`) and stored in the superblock node. The log area contains only
+two types of nodes: *reference nodes* and *commit start nodes*. A commit start
+node is written whenever an index commit is performed. Reference nodes are
+written on every journal update. Each reference node points to the position of
+other nodes (inode nodes, data nodes etc.) on the flash that are part of this
+journal entry. These nodes are called *buds* and describe the actual filesystem
+changes including their data.
+
+The log area is maintained as a ring. Whenever the journal is almost full,
+a commit is initiated. This also writes a commit start node so that during
+mount, UBIFS will seek for the most recent commit start node and just replay
+every reference node after that. Every reference node before the commit start
+node will be ignored as they are already part of the on-flash index.
+
+When writing a journal entry, UBIFS first ensures that enough space is
+available to write the reference node and buds part of this entry. Then, the
+reference node is written and afterwards the buds describing the file changes.
+On replay, UBIFS will record every reference node and inspect the location of
+the referenced LEBs to discover the buds. If these are corrupt or missing,
+UBIFS will attempt to recover them by re-reading the LEB. This is however only
+done for the last referenced LEB of the journal. Only this can become corrupt
+because of a power cut. If the recovery fails, UBIFS will not mount. An error
+for every other LEB will directly cause UBIFS to fail the mount operation.
+
+
+ | ---- LOG AREA ---- | ---------- MAIN AREA ------------ |
+
+ -----+------+-----+--------+---- ------+-----+-----+---------------
+ \ | | | | / / | | | \
+ / CS | REF | REF | | \ \ DENT | INO | INO | /
+ \ | | | | / / | | | \
+ ----+------+-----+--------+--- -------+-----+-----+----------------
+ | | ^ ^
+ | | | |
+ +------------------------+ |
+ | |
+ +-------------------------------+
+
+
+ Figure 2: UBIFS flash layout of log area with commit start nodes
+ (CS) and reference nodes (REF) pointing to main area
+ containing their buds
+
+
+### LEB Property Tree/Table
+
+The LEB property tree is used to store per-LEB information. This includes the
+LEB type and amount of free and *dirty* (old, obsolete content) space [1] on
+the LEB. The type is important, because UBIFS never mixes index nodes with data
+nodes on a single LEB and thus each LEB has a specific purpose. This again is
+useful for free space calculations. See [UBIFS-WP] for more details.
+
+The LEB property tree again is a B+ tree, but it is much smaller than the
+index. Due to its smaller size it is always written as one chunk on every
+commit. Thus, saving the LPT is an atomic operation.
+
+
+[1] Since LEBs can only be appended and never overwritten, there is a
+difference between free space ie. the remaining space left on the LEB to be
+written to without erasing it and previously written content that is obsolete
+but can't be overwritten without erasing the full LEB.
+
+
+# UBIFS Authentication
+
+This chapter introduces UBIFS authentication which enables UBIFS to verify
+the authenticity and integrity of metadata and file contents stored on flash.
+
+
+## Threat Model
+
+UBIFS authentication enables detection of offline data modification. While it
+does not prevent it, it enables (trusted) code to check the integrity and
+authenticity of on-flash file contents and filesystem metadata. This covers
+attacks where file contents are swapped.
+
+UBIFS authentication will not protect against rollback of full flash contents.
+Ie. an attacker can still dump the flash and restore it at a later time without
+detection. It will also not protect against partial rollback of individual
+index commits. That means that an attacker is able to partially undo changes.
+This is possible because UBIFS does not immediately overwrites obsolete
+versions of the index tree or the journal, but instead marks them as obsolete
+and garbage collection erases them at a later time. An attacker can use this by
+erasing parts of the current tree and restoring old versions that are still on
+the flash and have not yet been erased. This is possible, because every commit
+will always write a new version of the index root node and the master node
+without overwriting the previous version. This is further helped by the
+wear-leveling operations of UBI which copies contents from one physical
+eraseblock to another and does not atomically erase the first eraseblock.
+
+UBIFS authentication does not cover attacks where an attacker is able to
+execute code on the device after the authentication key was provided.
+Additional measures like secure boot and trusted boot have to be taken to
+ensure that only trusted code is executed on a device.
+
+
+## Authentication
+
+To be able to fully trust data read from flash, all UBIFS data structures
+stored on flash are authenticated. That is:
+
+- The index which includes file contents, file metadata like extended
+ attributes, file length etc.
+- The journal which also contains file contents and metadata by recording changes
+ to the filesystem
+- The LPT which stores UBI LEB metadata which UBIFS uses for free space accounting
+
+
+### Index Authentication
+
+Through UBIFS' concept of a wandering tree, it already takes care of only
+updating and persisting changed parts from leaf node up to the root node
+of the full B+ tree. This enables us to augment the index nodes of the tree
+with a hash over each node's child nodes. As a result, the index basically also
+a Merkle tree. Since the leaf nodes of the index contain the actual filesystem
+data, the hashes of their parent index nodes thus cover all the file contents
+and file metadata. When a file changes, the UBIFS index is updated accordingly
+from the leaf nodes up to the root node including the master node. This process
+can be hooked to recompute the hash only for each changed node at the same time.
+Whenever a file is read, UBIFS can verify the hashes from each leaf node up to
+the root node to ensure the node's integrity.
+
+To ensure the authenticity of the whole index, the UBIFS master node stores a
+keyed hash (HMAC) over its own contents and a hash of the root node of the index
+tree. As mentioned above, the master node is always written to the flash whenever
+the index is persisted (ie. on index commit).
+
+Using this approach only UBIFS index nodes and the master node are changed to
+include a hash. All other types of nodes will remain unchanged. This reduces
+the storage overhead which is precious for users of UBIFS (ie. embedded
+devices).
+
+
+ +---------------+
+ | Master Node |
+ | (hash) |
+ +---------------+
+ |
+ v
+ +-------------------+
+ | Index Node #1 |
+ | |
+ | branch0 branchn |
+ | (hash) (hash) |
+ +-------------------+
+ | ... | (fanout: 8)
+ | |
+ +-------+ +------+
+ | |
+ v v
+ +-------------------+ +-------------------+
+ | Index Node #2 | | Index Node #3 |
+ | | | |
+ | branch0 branchn | | branch0 branchn |
+ | (hash) (hash) | | (hash) (hash) |
+ +-------------------+ +-------------------+
+ | ... | ... |
+ v v v
+ +-----------+ +----------+ +-----------+
+ | Data Node | | INO Node | | DENT Node |
+ +-----------+ +----------+ +-----------+
+
+
+ Figure 3: Coverage areas of index node hash and master node HMAC
+
+
+
+The most important part for robustness and power-cut safety is to atomically
+persist the hash and file contents. Here the existing UBIFS logic for how
+changed nodes are persisted is already designed for this purpose such that
+UBIFS can safely recover if a power-cut occurs while persisting. Adding
+hashes to index nodes does not change this since each hash will be persisted
+atomically together with its respective node.
+
+
+### Journal Authentication
+
+The journal is authenticated too. Since the journal is continuously written
+it is necessary to also add authentication information frequently to the
+journal so that in case of a powercut not too much data can't be authenticated.
+This is done by creating a continuous hash beginning from the commit start node
+over the previous reference nodes, the current reference node, and the bud
+nodes. From time to time whenever it is suitable authentication nodes are added
+between the bud nodes. This new node type contains a HMAC over the current state
+of the hash chain. That way a journal can be authenticated up to the last
+authentication node. The tail of the journal which may not have a authentication
+node cannot be authenticated and is skipped during journal replay.
+
+We get this picture for journal authentication:
+
+ ,,,,,,,,
+ ,......,...........................................
+ ,. CS , hash1.----. hash2.----.
+ ,. | , . |hmac . |hmac
+ ,. v , . v . v
+ ,.REF#0,-> bud -> bud -> bud.-> auth -> bud -> bud.-> auth ...
+ ,..|...,...........................................
+ , | ,
+ , | ,,,,,,,,,,,,,,,
+ . | hash3,----.
+ , | , |hmac
+ , v , v
+ , REF#1 -> bud -> bud,-> auth ...
+ ,,,|,,,,,,,,,,,,,,,,,,
+ v
+ REF#2 -> ...
+ |
+ V
+ ...
+
+Since the hash also includes the reference nodes an attacker cannot reorder or
+skip any journal heads for replay. An attacker can only remove bud nodes or
+reference nodes from the end of the journal, effectively rewinding the
+filesystem at maximum back to the last commit.
+
+The location of the log area is stored in the master node. Since the master
+node is authenticated with a HMAC as described above, it is not possible to
+tamper with that without detection. The size of the log area is specified when
+the filesystem is created using `mkfs.ubifs` and stored in the superblock node.
+To avoid tampering with this and other values stored there, a HMAC is added to
+the superblock struct. The superblock node is stored in LEB 0 and is only
+modified on feature flag or similar changes, but never on file changes.
+
+
+### LPT Authentication
+
+The location of the LPT root node on the flash is stored in the UBIFS master
+node. Since the LPT is written and read atomically on every commit, there is
+no need to authenticate individual nodes of the tree. It suffices to
+protect the integrity of the full LPT by a simple hash stored in the master
+node. Since the master node itself is authenticated, the LPTs authenticity can
+be verified by verifying the authenticity of the master node and comparing the
+LTP hash stored there with the hash computed from the read on-flash LPT.
+
+
+## Key Management
+
+For simplicity, UBIFS authentication uses a single key to compute the HMACs
+of superblock, master, commit start and reference nodes. This key has to be
+available on creation of the filesystem (`mkfs.ubifs`) to authenticate the
+superblock node. Further, it has to be available on mount of the filesystem
+to verify authenticated nodes and generate new HMACs for changes.
+
+UBIFS authentication is intended to operate side-by-side with UBIFS encryption
+(fscrypt) to provide confidentiality and authenticity. Since UBIFS encryption
+has a different approach of encryption policies per directory, there can be
+multiple fscrypt master keys and there might be folders without encryption.
+UBIFS authentication on the other hand has an all-or-nothing approach in the
+sense that it either authenticates everything of the filesystem or nothing.
+Because of this and because UBIFS authentication should also be usable without
+encryption, it does not share the same master key with fscrypt, but manages
+a dedicated authentication key.
+
+The API for providing the authentication key has yet to be defined, but the
+key can eg. be provided by userspace through a keyring similar to the way it
+is currently done in fscrypt. It should however be noted that the current
+fscrypt approach has shown its flaws and the userspace API will eventually
+change [FSCRYPT-POLICY2].
+
+Nevertheless, it will be possible for a user to provide a single passphrase
+or key in userspace that covers UBIFS authentication and encryption. This can
+be solved by the corresponding userspace tools which derive a second key for
+authentication in addition to the derived fscrypt master key used for
+encryption.
+
+To be able to check if the proper key is available on mount, the UBIFS
+superblock node will additionally store a hash of the authentication key. This
+approach is similar to the approach proposed for fscrypt encryption policy v2
+[FSCRYPT-POLICY2].
+
+
+# Future Extensions
+
+In certain cases where a vendor wants to provide an authenticated filesystem
+image to customers, it should be possible to do so without sharing the secret
+UBIFS authentication key. Instead, in addition the each HMAC a digital
+signature could be stored where the vendor shares the public key alongside the
+filesystem image. In case this filesystem has to be modified afterwards,
+UBIFS can exchange all digital signatures with HMACs on first mount similar
+to the way the IMA/EVM subsystem deals with such situations. The HMAC key
+will then have to be provided beforehand in the normal way.
+
+
+# References
+
+[CRYPTSETUP2] http://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
+
+[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
+
+[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt
+
+[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt
+
+[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html
+
+[UBIFS-WP] http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt
index a0a61d2f389f..acc80442a3bb 100644
--- a/Documentation/filesystems/ubifs.txt
+++ b/Documentation/filesystems/ubifs.txt
@@ -91,6 +91,13 @@ chk_data_crc do not skip checking CRCs on data nodes
compr=none override default compressor and set it to "none"
compr=lzo override default compressor and set it to "lzo"
compr=zlib override default compressor and set it to "zlib"
+auth_key= specify the key used for authenticating the filesystem.
+ Passing this option makes authentication mandatory.
+ The passed key must be present in the kernel keyring
+ and must be of type 'logon'
+auth_hash_name= The hash algorithm used for authentication. Used for
+ both hashing and for creating HMACs. Typical values
+ include "sha256" or "sha512"
Quick usage instructions
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index a6c6a8af48a2..8dc8e9c2913f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -883,8 +883,9 @@ struct file_operations {
unsigned (*mmap_capabilities)(struct file *);
#endif
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
- int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
- int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64);
+ loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
};
@@ -960,11 +961,18 @@ otherwise noted.
copy_file_range: called by the copy_file_range(2) system call.
- clone_file_range: called by the ioctl(2) system call for FICLONERANGE and
- FICLONE commands.
-
- dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
- command.
+ remap_file_range: called by the ioctl(2) system call for FICLONERANGE and
+ FICLONE and FIDEDUPERANGE commands to remap file ranges. An
+ implementation should remap len bytes at pos_in of the source file into
+ the dest file at pos_out. Implementations must handle callers passing
+ in len == 0; this means "remap to the end of the source file". The
+ return value should the number of bytes remapped, or the usual
+ negative error code if errors occurred before any bytes were remapped.
+ The remap_flags parameter accepts REMAP_FILE_* flags. If
+ REMAP_FILE_DEDUP is set then the implementation must only remap if the
+ requested file ranges have identical contents. If REMAP_CAN_SHORTEN is
+ set, the caller is ok with the implementation shortening the request
+ length to satisfy alignment or EOF requirements (or any other reason).
fadvise: possibly called by the fadvise64() system call.
@@ -1123,7 +1131,7 @@ struct dentry_operations {
d_manage: called to allow the filesystem to manage the transition from a
dentry (optional). This allows autofs, for example, to hold up clients
- waiting to explore behind a 'mountpoint' whilst letting the daemon go
+ waiting to explore behind a 'mountpoint' while letting the daemon go
past and construct the subtree there. 0 should be returned to let the
calling process continue. -EISDIR can be returned to tell pathwalk to
use this directory as an ordinary directory and to ignore anything
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt
index 05aa455163e3..68604e67a495 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.txt
+++ b/Documentation/filesystems/xfs-self-describing-metadata.txt
@@ -110,7 +110,7 @@ owner field in the metadata object, we can immediately do top down validation to
determine the scope of the problem.
Different types of metadata have different owner identifiers. For example,
-directory, attribute and extent tree blocks are all owned by an inode, whilst
+directory, attribute and extent tree blocks are all owned by an inode, while
freespace btree blocks are owned by an allocation group. Hence the size and
contents of the owner field are determined by the type of metadata object we are
looking at. The owner information can also identify misplaced writes (e.g.
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index a9ae82fb9d13..9ccfd1bc6201 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -417,7 +417,7 @@ level directory:
filesystem from ever unmounting fully in the case of "retry forever"
handler configurations.
- Note: there is no guarantee that fail_at_unmount can be set whilst an
+ Note: there is no guarantee that fail_at_unmount can be set while an
unmount is in progress. It is possible that the sysfs entries are
removed by the unmounting filesystem before a "retry forever" error
handler configuration causes unmount to hang, and hence the filesystem