6 files changed, 426 insertions, 13 deletions
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst
new file mode 100644
index 000000000000..4804441155f5
--- /dev/null
+++ b/Documentation/filesystems/nfs/client-identifier.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+NFSv4 client identifier
+=======================
+
+This document explains how the NFSv4 protocol identifies client
+instances in order to maintain file open and lock state during
+system restarts. A special identifier and principal are maintained
+on each client. These can be set by administrators, scripts
+provided by site administrators, or tools provided by Linux
+distributors.
+
+There are risks if a client's NFSv4 identifier and its principal
+are not chosen carefully.
+
+
+Introduction
+------------
+
+The NFSv4 protocol uses "lease-based file locking". Leases help
+NFSv4 servers provide file lock guarantees and manage their
+resources.
+
+Simply put, an NFSv4 server creates a lease for each NFSv4 client.
+The server collects each client's file open and lock state under
+the lease for that client.
+
+The client is responsible for periodically renewing its leases.
+While a lease remains valid, the server holding that lease
+guarantees the file locks the client has created remain in place.
+
+If a client stops renewing its lease (for example, if it crashes),
+the NFSv4 protocol allows the server to remove the client's open
+and lock state after a certain period of time. When a client
+restarts, it indicates to servers that open and lock state
+associated with its previous leases is no longer valid and can be
+destroyed immediately.
+
+In addition, each NFSv4 server manages a persistent list of client
+leases. When the server restarts and clients attempt to recover
+their state, the server uses this list to distinguish amongst
+clients that held state before the server restarted and clients
+sending fresh OPEN and LOCK requests. This enables file locks to
+persist safely across server restarts.
+
+NFSv4 client identifiers
+------------------------
+
+Each NFSv4 client presents an identifier to NFSv4 servers so that
+they can associate the client with its lease. Each client's
+identifier consists of two elements:
+
+  - co_ownerid: An arbitrary but fixed string.
+
+  - boot verifier: A 64-bit incarnation verifier that enables a
+    server to distinguish successive boot epochs of the same client.
+
+The NFSv4.0 specification refers to these two items as an
+"nfs_client_id4". The NFSv4.1 specification refers to these two
+items as a "client_owner4".
+
+NFSv4 servers tie this identifier to the principal and security
+flavor that the client used when presenting it. Servers use this
+principal to authorize subsequent lease modification operations
+sent by the client. Effectively this principal is a third element of
+the identifier.
+
+As part of the identity presented to servers, a good
+"co_ownerid" string has several important properties:
+
+  - The "co_ownerid" string identifies the client during reboot
+    recovery, therefore the string is persistent across client
+    reboots.
+  - The "co_ownerid" string helps servers distinguish the client
+    from others, therefore the string is globally unique. Note
+    that there is no central authority that assigns "co_ownerid"
+    strings.
+  - Because it often appears on the network in the clear, the
+    "co_ownerid" string does not reveal private information about
+    the client itself.
+  - The content of the "co_ownerid" string is set and unchanging
+    before the client attempts NFSv4 mounts after a restart.
+  - The NFSv4 protocol places a 1024-byte limit on the size of the
+    "co_ownerid" string.
+
+Protecting NFSv4 lease state
+----------------------------
+
+NFSv4 servers utilize the "client_owner4" as described above to
+assign a unique lease to each client. Under this scheme, there are
+circumstances where clients can interfere with each other. This is
+referred to as "lease stealing".
+
+If distinct clients present the same "co_ownerid" string and use
+the same principal (for example, AUTH_SYS and UID 0), a server is
+unable to tell that the clients are not the same. Each distinct
+client presents a different boot verifier, so it appears to the
+server as if there is one client that is rebooting frequently.
+Neither client can maintain open or lock state in this scenario.
+
+If distinct clients present the same "co_ownerid" string and use
+distinct principals, the server is likely to allow the first client
+to operate normally but reject subsequent clients with the same
+"co_ownerid" string.
+
+If a client's "co_ownerid" string or principal are not stable,
+state recovery after a server or client reboot is not guaranteed.
+If a client unexpectedly restarts but presents a different
+"co_ownerid" string or principal to the server, the server orphans
+the client's previous open and lock state. This blocks access to
+locked files until the server removes the orphaned state.
+
+If the server restarts and a client presents a changed "co_ownerid"
+string or principal to the server, the server will not allow the
+client to reclaim its open and lock state, and may give those locks
+to other clients in the meantime. This is referred to as "lock
+stealing".
+
+Lease stealing and lock stealing increase the potential for denial
+of service and in rare cases even data corruption.
+
+Selecting an appropriate client identifier
+------------------------------------------
+
+By default, the Linux NFSv4 client implementation constructs its
+"co_ownerid" string starting with the words "Linux NFS" followed by
+the client's UTS node name (the same node name, incidentally, that
+is used as the "machine name" in an AUTH_SYS credential). In small
+deployments, this construction is usually adequate. Often, however,
+the node name by itself is not adequately unique, and can change
+unexpectedly. Problematic situations include:
+
+  - NFS-root (diskless) clients, where the local DHCP server (or
+    equivalent) does not provide a unique host name.
+
+  - "Containers" within a single Linux host.  If each container has
+    a separate network namespace, but does not use the UTS namespace
+    to provide a unique host name, then there can be multiple NFS
+    client instances with the same host name.
+
+  - Clients across multiple administrative domains that access a
+    common NFS server. If hostnames are not assigned centrally
+    then uniqueness cannot be guaranteed unless a domain name is
+    included in the hostname.
+
+Linux provides two mechanisms to add uniqueness to its "co_ownerid"
+string:
+
+    nfs.nfs4_unique_id
+      This module parameter can set an arbitrary uniquifier string
+      via the kernel command line, or when the "nfs" module is
+      loaded.
+
+    /sys/fs/nfs/net/nfs_client/identifier
+      This virtual file, available since Linux 5.3, is local to the
+      network namespace in which it is accessed and so can provide
+      distinction between network namespaces (containers) when the
+      hostname remains uniform.
+
+Note that this file is empty on name-space creation. If the
+container system has access to some sort of per-container identity
+then that uniquifier can be used. For example, a uniquifier might
+be formed at boot using the container's internal identifier:
+
+    sha256sum /etc/machine-id | awk '{print $1}' \\
+        > /sys/fs/nfs/net/nfs_client/identifier
+
+Security considerations
+-----------------------
+
+The use of cryptographic security for lease management operations
+is strongly encouraged.
+
+If NFS with Kerberos is not configured, a Linux NFSv4 client uses
+AUTH_SYS and UID 0 as the principal part of its client identity.
+This configuration is not only insecure, it increases the risk of
+lease and lock stealing. However, it might be the only choice for
+client configurations that have no local persistent storage.
+"co_ownerid" string uniqueness and persistence is critical in this
+case.
+
+When a Kerberos keytab is present on a Linux NFS client, the client
+attempts to use one of the principals in that keytab when
+identifying itself to servers. The "sec=" mount option does not
+control this behavior. Alternately, a single-user client with a
+Kerberos principal can use that principal in place of the client's
+host principal.
+
+Using Kerberos for this purpose enables the client and server to
+use the same lease for operations covered by all "sec=" settings.
+Additionally, the Linux NFS client uses the RPCSEC_GSS security
+flavor with Kerberos and the integrity QOS to prevent in-transit
+modification of lease modification requests.
+
+Additional notes
+----------------
+The Linux NFSv4 client establishes a single lease on each NFSv4
+server it accesses. NFSv4 mounts from a Linux NFSv4 client of a
+particular server then share that lease.
+
+Once a client establishes open and lock state, the NFSv4 protocol
+enables lease state to transition to other servers, following data
+that has been migrated. This hides data migration completely from
+running applications. The Linux NFSv4 client facilitates state
+migration by presenting the same "client_owner4" to all servers it
+encounters.
+
+========
+See Also
+========
+
+  - nfs(5)
+  - kerberos(7)
+  - RFC 7530 for the NFSv4.0 specification
+  - RFC 8881 for the NFSv4.1 specification.
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 33d588a01ace..f04ce1215a03 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct
 super_block.  This field must point to a "struct export_operations"
 struct which has the following members:
 
- encode_fh  (optional)
-    Takes a dentry and creates a filehandle fragment which can later be used
-    to find or create a dentry for the same object.  The default
-    implementation creates a filehandle fragment that encodes a 32bit inode
-    and generation number for the inode encoded, and if necessary the
-    same information for the parent.
+  encode_fh (mandatory)
+    Takes a dentry and creates a filehandle fragment which may later be used
+    to find or create a dentry for the same object.
 
   fh_to_dentry (mandatory)
     Given a filehandle fragment, this should find the implied object and
@@ -154,6 +151,11 @@ struct which has the following members:
     to find potential names, and matches inode numbers to find the correct
     match.
 
+  flags
+    Some filesystems may need to be handled differently than others. The
+    export_operations struct also includes a flags field that allows the
+    filesystem to communicate such information to nfsd. See the Export
+    Operations Flags section below for more explanation.
 
 A filehandle fragment consists of an array of 1 or more 4byte words,
 together with a one byte "type".
@@ -163,3 +165,83 @@ generated by encode_fh, in which case it will have been padded with
 nuls.  Rather, the encode_fh routine should choose a "type" which
 indicates the decode_fh how much of the filehandle is valid, and how
 it should be interpreted.
+
+Export Operations Flags
+-----------------------
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+  EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
+    RFC 1813 recommends that servers always send weak cache consistency
+    (WCC) data to the client after each operation. The server should
+    atomically collect attributes about the inode, do an operation on it,
+    and then collect the attributes afterward. This allows the client to
+    skip issuing GETATTRs in some situations but means that the server
+    is calling vfs_getattr for almost all RPCs. On some filesystems
+    (particularly those that are clustered or networked) this is expensive
+    and atomicity is difficult to guarantee. This flag indicates to nfsd
+    that it should skip providing WCC attributes to the client in NFSv3
+    replies when doing operations on this filesystem. Consider enabling
+    this on filesystems that have an expensive ->getattr inode operation,
+    or when atomicity between pre and post operation attribute collection
+    is impossible to guarantee.
+
+  EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
+    Many NFS operations deal with filehandles, which the server must then
+    vet to ensure that they live inside of an exported tree. When the
+    export consists of an entire filesystem, this is trivial. nfsd can just
+    ensure that the filehandle live on the filesystem. When only part of a
+    filesystem is exported however, then nfsd must walk the ancestors of the
+    inode to ensure that it's within an exported subtree. This is an
+    expensive operation and not all filesystems can support it properly.
+    This flag exempts the filesystem from subtree checking and causes
+    exportfs to get back an error if it tries to enable subtree checking
+    on it.
+
+  EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+    On some exportable filesystems (such as NFS) unlinking a file that
+    is still open can cause a fair bit of extra work. For instance,
+    the NFS client will do a "sillyrename" to ensure that the file
+    sticks around while it's still open. When reexporting, that open
+    file is held by nfsd so we usually end up doing a sillyrename, and
+    then immediately deleting the sillyrenamed file just afterward when
+    the link count actually goes to zero. Sometimes this delete can race
+    with other operations (for instance an rmdir of the parent directory).
+    This flag causes nfsd to close any open files for this inode _before_
+    calling into the vfs to do an unlink or a rename that would replace
+    an existing file.
+
+  EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
+    PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to
+    write to one bdi (the final bdi) in order to free up writes queued
+    to another bdi (the client bdi). Such threads get a private balance
+    of dirty pages so that dirty pages for the client bdi do not imact
+    the daemon writing to the final bdi. For filesystems whose durable
+    storage is not local (such as exported NFS filesystems), this
+    constraint has negative consequences. EXPORT_OP_REMOTE_FS enables
+    an export to disable writeback throttling.
+
+  EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
+    EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem
+    cannot provide the semantics required by the "atomic" boolean in
+    NFSv4's change_info4. This boolean indicates to a client whether the
+    returned before and after change attributes were obtained atomically
+    with the respect to the requested metadata operation (UNLINK,
+    OPEN/CREATE, MKDIR, etc).
+
+  EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
+    On most filesystems, inodes can remain under writeback after the
+    file is closed. NFSD relies on client activity or local flusher
+    threads to handle writeback. Certain filesystems, such as NFS, flush
+    all of an inode's dirty data on last close. Exports that behave this
+    way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
+    waiting for writeback when closing such files.
+
+  EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
+    requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
+    it's own ->lock() functionality as core posix_lock_file() implementation
+    has no async lock request handling yet. For more information about how to
+    indicate an async lock request from a ->lock() file_operations struct, see
+    fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 65805624e39b..8536134f31fd 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -6,8 +6,11 @@ NFS
 .. toctree::
    :maxdepth: 1
 
+   client-identifier
+   exporting
    pnfs
    rpc-cache
    rpc-server-gss
    nfs41-server
    knfsd-stats
+   reexport
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
new file mode 100644
index 000000000000..ff9ae4a46530
--- /dev/null
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -0,0 +1,113 @@
+Reexporting NFS filesystems
+===========================
+
+Overview
+--------
+
+It is possible to reexport an NFS filesystem over NFS.  However, this
+feature comes with a number of limitations.  Before trying it, we
+recommend some careful research to determine whether it will work for
+your purposes.
+
+A discussion of current known limitations follows.
+
+"fsid=" required, crossmnt broken
+---------------------------------
+
+We require the "fsid=" export option on any reexport of an NFS
+filesystem.  You can use "uuidgen -r" to generate a unique argument.
+
+The "crossmnt" export does not propagate "fsid=", so it will not allow
+traversing into further nfs filesystems; if you wish to export nfs
+filesystems mounted under the exported filesystem, you'll need to export
+them explicitly, assigning each its own unique "fsid= option.
+
+Reboot recovery
+---------------
+
+The NFS protocol's normal reboot recovery mechanisms don't work for the
+case when the reexport server reboots.  Clients will lose any locks
+they held before the reboot, and further IO will result in errors.
+Closing and reopening files should clear the errors.
+
+Filehandle limits
+-----------------
+
+If the original server uses an X byte filehandle for a given object, the
+reexport server's filehandle for the reexported object will be X+22
+bytes, rounded up to the nearest multiple of four bytes.
+
+The result must fit into the RFC-mandated filehandle size limits:
+
++-------+-----------+
+| NFSv2 |  32 bytes |
++-------+-----------+
+| NFSv3 |  64 bytes |
++-------+-----------+
+| NFSv4 | 128 bytes |
++-------+-----------+
+
+So, for example, you will only be able to reexport a filesystem over
+NFSv2 if the original server gives you filehandles that fit in 10
+bytes--which is unlikely.
+
+In general there's no way to know the maximum filehandle size given out
+by an NFS server without asking the server vendor.
+
+But the following table gives a few examples.  The first column is the
+typical length of the filehandle from a Linux server exporting the given
+filesystem, the second is the length after that nfs export is reexported
+by another Linux host:
+
++--------+-------------------+----------------+
+|        | filehandle length | after reexport |
++========+===================+================+
+| ext4:  | 28 bytes          | 52 bytes       |
++--------+-------------------+----------------+
+| xfs:   | 32 bytes          | 56 bytes       |
++--------+-------------------+----------------+
+| btrfs: | 40 bytes          | 64 bytes       |
++--------+-------------------+----------------+
+
+All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
+but none are reexportable over NFSv2.
+
+Linux server filehandles are a bit more complicated than this, though;
+for example:
+
+        - The (non-default) "subtreecheck" export option generally
+          requires another 4 to 8 bytes in the filehandle.
+        - If you export a subdirectory of a filesystem (instead of
+          exporting the filesystem root), that also usually adds 4 to 8
+          bytes.
+        - If you export over NFSv2, knfsd usually uses a shorter
+          filesystem identifier that saves 8 bytes.
+        - The root directory of an export uses a filehandle that is
+          shorter.
+
+As you can see, the 128-byte NFSv4 filehandle is large enough that
+you're unlikely to have trouble using NFSv4 to reexport any filesystem
+exported from a Linux server.  In general, if the original server is
+something that also supports NFSv3, you're *probably* OK.  Re-exporting
+over NFSv3 may be dicier, and reexporting over NFSv2 will probably
+never work.
+
+For more details of Linux filehandle structure, the best reference is
+the source code and comments; see in particular:
+
+        - include/linux/exportfs.h:enum fid_type
+        - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
+        - fs/nfsd/nfsfh.c:set_version_and_fsid_type
+        - fs/nfs/export.c:nfs_encode_fh
+
+Open DENY bits ignored
+----------------------
+
+NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
+allow you, for example, to open a file in a mode which forbids other
+read opens or write opens. The Linux client doesn't use them, and the
+server's support has always been incomplete: they are enforced only
+against other NFS users, not against processes accessing the exported
+filesystem locally. A reexport server will also not pass them along to
+the original server, so they will not be enforced between clients of
+different reexport servers.
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
index bb164eea969b..339efd75016a 100644
--- a/Documentation/filesystems/nfs/rpc-cache.rst
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -78,7 +78,7 @@ Creating a Cache
       include taking references to shared objects.
 
     void update(struct cache_head \*orig, struct cache_head \*new)
-      Set the 'content' fileds in 'new' from 'orig'.
+      Set the 'content' fields in 'new' from 'orig'.
 
     int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
       Optional.  Used to provide a /proc file that lists the
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
index 812754576845..5c1a1c58fc27 100644
--- a/Documentation/filesystems/nfs/rpc-server-gss.rst
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -10,13 +10,12 @@ purposes of authentication.)
 
 RPCGSS is specified in a few IETF documents:
 
- - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
 
-and there is a 3rd version  being proposed:
+There is a third version that we don't currently implement:
 
- - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
-   (At draft n. 02 at the time of writing)
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
 
 Background
 ==========
@@ -30,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and
 depends on GSSAPI extensions that are KRB5 specific.
 
 GSSAPI is a complex library, and implementing it completely in kernel is
-unwarranted. However GSSAPI operations are fundementally separable in 2
+unwarranted. However GSSAPI operations are fundamentally separable in 2
 parts:
 
 - initial context establishment