diff options
Diffstat (limited to 'Documentation/filesystems/nfs')
-rw-r--r-- | Documentation/filesystems/nfs/client-identifier.rst | 216 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/exporting.rst | 94 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 3 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/reexport.rst | 113 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/rpc-cache.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/rpc-server-gss.rst | 11 |
6 files changed, 426 insertions, 13 deletions
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst new file mode 100644 index 000000000000..4804441155f5 --- /dev/null +++ b/Documentation/filesystems/nfs/client-identifier.rst @@ -0,0 +1,216 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +NFSv4 client identifier +======================= + +This document explains how the NFSv4 protocol identifies client +instances in order to maintain file open and lock state during +system restarts. A special identifier and principal are maintained +on each client. These can be set by administrators, scripts +provided by site administrators, or tools provided by Linux +distributors. + +There are risks if a client's NFSv4 identifier and its principal +are not chosen carefully. + + +Introduction +------------ + +The NFSv4 protocol uses "lease-based file locking". Leases help +NFSv4 servers provide file lock guarantees and manage their +resources. + +Simply put, an NFSv4 server creates a lease for each NFSv4 client. +The server collects each client's file open and lock state under +the lease for that client. + +The client is responsible for periodically renewing its leases. +While a lease remains valid, the server holding that lease +guarantees the file locks the client has created remain in place. + +If a client stops renewing its lease (for example, if it crashes), +the NFSv4 protocol allows the server to remove the client's open +and lock state after a certain period of time. When a client +restarts, it indicates to servers that open and lock state +associated with its previous leases is no longer valid and can be +destroyed immediately. + +In addition, each NFSv4 server manages a persistent list of client +leases. When the server restarts and clients attempt to recover +their state, the server uses this list to distinguish amongst +clients that held state before the server restarted and clients +sending fresh OPEN and LOCK requests. This enables file locks to +persist safely across server restarts. + +NFSv4 client identifiers +------------------------ + +Each NFSv4 client presents an identifier to NFSv4 servers so that +they can associate the client with its lease. Each client's +identifier consists of two elements: + + - co_ownerid: An arbitrary but fixed string. + + - boot verifier: A 64-bit incarnation verifier that enables a + server to distinguish successive boot epochs of the same client. + +The NFSv4.0 specification refers to these two items as an +"nfs_client_id4". The NFSv4.1 specification refers to these two +items as a "client_owner4". + +NFSv4 servers tie this identifier to the principal and security +flavor that the client used when presenting it. Servers use this +principal to authorize subsequent lease modification operations +sent by the client. Effectively this principal is a third element of +the identifier. + +As part of the identity presented to servers, a good +"co_ownerid" string has several important properties: + + - The "co_ownerid" string identifies the client during reboot + recovery, therefore the string is persistent across client + reboots. + - The "co_ownerid" string helps servers distinguish the client + from others, therefore the string is globally unique. Note + that there is no central authority that assigns "co_ownerid" + strings. + - Because it often appears on the network in the clear, the + "co_ownerid" string does not reveal private information about + the client itself. + - The content of the "co_ownerid" string is set and unchanging + before the client attempts NFSv4 mounts after a restart. + - The NFSv4 protocol places a 1024-byte limit on the size of the + "co_ownerid" string. + +Protecting NFSv4 lease state +---------------------------- + +NFSv4 servers utilize the "client_owner4" as described above to +assign a unique lease to each client. Under this scheme, there are +circumstances where clients can interfere with each other. This is +referred to as "lease stealing". + +If distinct clients present the same "co_ownerid" string and use +the same principal (for example, AUTH_SYS and UID 0), a server is +unable to tell that the clients are not the same. Each distinct +client presents a different boot verifier, so it appears to the +server as if there is one client that is rebooting frequently. +Neither client can maintain open or lock state in this scenario. + +If distinct clients present the same "co_ownerid" string and use +distinct principals, the server is likely to allow the first client +to operate normally but reject subsequent clients with the same +"co_ownerid" string. + +If a client's "co_ownerid" string or principal are not stable, +state recovery after a server or client reboot is not guaranteed. +If a client unexpectedly restarts but presents a different +"co_ownerid" string or principal to the server, the server orphans +the client's previous open and lock state. This blocks access to +locked files until the server removes the orphaned state. + +If the server restarts and a client presents a changed "co_ownerid" +string or principal to the server, the server will not allow the +client to reclaim its open and lock state, and may give those locks +to other clients in the meantime. This is referred to as "lock +stealing". + +Lease stealing and lock stealing increase the potential for denial +of service and in rare cases even data corruption. + +Selecting an appropriate client identifier +------------------------------------------ + +By default, the Linux NFSv4 client implementation constructs its +"co_ownerid" string starting with the words "Linux NFS" followed by +the client's UTS node name (the same node name, incidentally, that +is used as the "machine name" in an AUTH_SYS credential). In small +deployments, this construction is usually adequate. Often, however, +the node name by itself is not adequately unique, and can change +unexpectedly. Problematic situations include: + + - NFS-root (diskless) clients, where the local DHCP server (or + equivalent) does not provide a unique host name. + + - "Containers" within a single Linux host. If each container has + a separate network namespace, but does not use the UTS namespace + to provide a unique host name, then there can be multiple NFS + client instances with the same host name. + + - Clients across multiple administrative domains that access a + common NFS server. If hostnames are not assigned centrally + then uniqueness cannot be guaranteed unless a domain name is + included in the hostname. + +Linux provides two mechanisms to add uniqueness to its "co_ownerid" +string: + + nfs.nfs4_unique_id + This module parameter can set an arbitrary uniquifier string + via the kernel command line, or when the "nfs" module is + loaded. + + /sys/fs/nfs/net/nfs_client/identifier + This virtual file, available since Linux 5.3, is local to the + network namespace in which it is accessed and so can provide + distinction between network namespaces (containers) when the + hostname remains uniform. + +Note that this file is empty on name-space creation. If the +container system has access to some sort of per-container identity +then that uniquifier can be used. For example, a uniquifier might +be formed at boot using the container's internal identifier: + + sha256sum /etc/machine-id | awk '{print $1}' \\ + > /sys/fs/nfs/net/nfs_client/identifier + +Security considerations +----------------------- + +The use of cryptographic security for lease management operations +is strongly encouraged. + +If NFS with Kerberos is not configured, a Linux NFSv4 client uses +AUTH_SYS and UID 0 as the principal part of its client identity. +This configuration is not only insecure, it increases the risk of +lease and lock stealing. However, it might be the only choice for +client configurations that have no local persistent storage. +"co_ownerid" string uniqueness and persistence is critical in this +case. + +When a Kerberos keytab is present on a Linux NFS client, the client +attempts to use one of the principals in that keytab when +identifying itself to servers. The "sec=" mount option does not +control this behavior. Alternately, a single-user client with a +Kerberos principal can use that principal in place of the client's +host principal. + +Using Kerberos for this purpose enables the client and server to +use the same lease for operations covered by all "sec=" settings. +Additionally, the Linux NFS client uses the RPCSEC_GSS security +flavor with Kerberos and the integrity QOS to prevent in-transit +modification of lease modification requests. + +Additional notes +---------------- +The Linux NFSv4 client establishes a single lease on each NFSv4 +server it accesses. NFSv4 mounts from a Linux NFSv4 client of a +particular server then share that lease. + +Once a client establishes open and lock state, the NFSv4 protocol +enables lease state to transition to other servers, following data +that has been migrated. This hides data migration completely from +running applications. The Linux NFSv4 client facilitates state +migration by presenting the same "client_owner4" to all servers it +encounters. + +======== +See Also +======== + + - nfs(5) + - kerberos(7) + - RFC 7530 for the NFSv4.0 specification + - RFC 8881 for the NFSv4.1 specification. diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index 33d588a01ace..f04ce1215a03 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct super_block. This field must point to a "struct export_operations" struct which has the following members: - encode_fh (optional) - Takes a dentry and creates a filehandle fragment which can later be used - to find or create a dentry for the same object. The default - implementation creates a filehandle fragment that encodes a 32bit inode - and generation number for the inode encoded, and if necessary the - same information for the parent. + encode_fh (mandatory) + Takes a dentry and creates a filehandle fragment which may later be used + to find or create a dentry for the same object. fh_to_dentry (mandatory) Given a filehandle fragment, this should find the implied object and @@ -154,6 +151,11 @@ struct which has the following members: to find potential names, and matches inode numbers to find the correct match. + flags + Some filesystems may need to be handled differently than others. The + export_operations struct also includes a flags field that allows the + filesystem to communicate such information to nfsd. See the Export + Operations Flags section below for more explanation. A filehandle fragment consists of an array of 1 or more 4byte words, together with a one byte "type". @@ -163,3 +165,83 @@ generated by encode_fh, in which case it will have been padded with nuls. Rather, the encode_fh routine should choose a "type" which indicates the decode_fh how much of the filehandle is valid, and how it should be interpreted. + +Export Operations Flags +----------------------- +In addition to the operation vector pointers, struct export_operations also +contains a "flags" field that allows the filesystem to communicate to nfsd +that it may want to do things differently when dealing with it. The +following flags are defined: + + EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem + RFC 1813 recommends that servers always send weak cache consistency + (WCC) data to the client after each operation. The server should + atomically collect attributes about the inode, do an operation on it, + and then collect the attributes afterward. This allows the client to + skip issuing GETATTRs in some situations but means that the server + is calling vfs_getattr for almost all RPCs. On some filesystems + (particularly those that are clustered or networked) this is expensive + and atomicity is difficult to guarantee. This flag indicates to nfsd + that it should skip providing WCC attributes to the client in NFSv3 + replies when doing operations on this filesystem. Consider enabling + this on filesystems that have an expensive ->getattr inode operation, + or when atomicity between pre and post operation attribute collection + is impossible to guarantee. + + EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs + Many NFS operations deal with filehandles, which the server must then + vet to ensure that they live inside of an exported tree. When the + export consists of an entire filesystem, this is trivial. nfsd can just + ensure that the filehandle live on the filesystem. When only part of a + filesystem is exported however, then nfsd must walk the ancestors of the + inode to ensure that it's within an exported subtree. This is an + expensive operation and not all filesystems can support it properly. + This flag exempts the filesystem from subtree checking and causes + exportfs to get back an error if it tries to enable subtree checking + on it. + + EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking + On some exportable filesystems (such as NFS) unlinking a file that + is still open can cause a fair bit of extra work. For instance, + the NFS client will do a "sillyrename" to ensure that the file + sticks around while it's still open. When reexporting, that open + file is held by nfsd so we usually end up doing a sillyrename, and + then immediately deleting the sillyrenamed file just afterward when + the link count actually goes to zero. Sometimes this delete can race + with other operations (for instance an rmdir of the parent directory). + This flag causes nfsd to close any open files for this inode _before_ + calling into the vfs to do an unlink or a rename that would replace + an existing file. + + EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote + PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to + write to one bdi (the final bdi) in order to free up writes queued + to another bdi (the client bdi). Such threads get a private balance + of dirty pages so that dirty pages for the client bdi do not imact + the daemon writing to the final bdi. For filesystems whose durable + storage is not local (such as exported NFS filesystems), this + constraint has negative consequences. EXPORT_OP_REMOTE_FS enables + an export to disable writeback throttling. + + EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically + EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem + cannot provide the semantics required by the "atomic" boolean in + NFSv4's change_info4. This boolean indicates to a client whether the + returned before and after change attributes were obtained atomically + with the respect to the requested metadata operation (UNLINK, + OPEN/CREATE, MKDIR, etc). + + EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2) + On most filesystems, inodes can remain under writeback after the + file is closed. NFSD relies on client activity or local flusher + threads to handle writeback. Certain filesystems, such as NFS, flush + all of an inode's dirty data on last close. Exports that behave this + way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip + waiting for writeback when closing such files. + + EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock + requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has + it's own ->lock() functionality as core posix_lock_file() implementation + has no async lock request handling yet. For more information about how to + indicate an async lock request from a ->lock() file_operations struct, see + fs/locks.c and comment for the function vfs_lock_file(). diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 65805624e39b..8536134f31fd 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -6,8 +6,11 @@ NFS .. toctree:: :maxdepth: 1 + client-identifier + exporting pnfs rpc-cache rpc-server-gss nfs41-server knfsd-stats + reexport diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst new file mode 100644 index 000000000000..ff9ae4a46530 --- /dev/null +++ b/Documentation/filesystems/nfs/reexport.rst @@ -0,0 +1,113 @@ +Reexporting NFS filesystems +=========================== + +Overview +-------- + +It is possible to reexport an NFS filesystem over NFS. However, this +feature comes with a number of limitations. Before trying it, we +recommend some careful research to determine whether it will work for +your purposes. + +A discussion of current known limitations follows. + +"fsid=" required, crossmnt broken +--------------------------------- + +We require the "fsid=" export option on any reexport of an NFS +filesystem. You can use "uuidgen -r" to generate a unique argument. + +The "crossmnt" export does not propagate "fsid=", so it will not allow +traversing into further nfs filesystems; if you wish to export nfs +filesystems mounted under the exported filesystem, you'll need to export +them explicitly, assigning each its own unique "fsid= option. + +Reboot recovery +--------------- + +The NFS protocol's normal reboot recovery mechanisms don't work for the +case when the reexport server reboots. Clients will lose any locks +they held before the reboot, and further IO will result in errors. +Closing and reopening files should clear the errors. + +Filehandle limits +----------------- + +If the original server uses an X byte filehandle for a given object, the +reexport server's filehandle for the reexported object will be X+22 +bytes, rounded up to the nearest multiple of four bytes. + +The result must fit into the RFC-mandated filehandle size limits: + ++-------+-----------+ +| NFSv2 | 32 bytes | ++-------+-----------+ +| NFSv3 | 64 bytes | ++-------+-----------+ +| NFSv4 | 128 bytes | ++-------+-----------+ + +So, for example, you will only be able to reexport a filesystem over +NFSv2 if the original server gives you filehandles that fit in 10 +bytes--which is unlikely. + +In general there's no way to know the maximum filehandle size given out +by an NFS server without asking the server vendor. + +But the following table gives a few examples. The first column is the +typical length of the filehandle from a Linux server exporting the given +filesystem, the second is the length after that nfs export is reexported +by another Linux host: + ++--------+-------------------+----------------+ +| | filehandle length | after reexport | ++========+===================+================+ +| ext4: | 28 bytes | 52 bytes | ++--------+-------------------+----------------+ +| xfs: | 32 bytes | 56 bytes | ++--------+-------------------+----------------+ +| btrfs: | 40 bytes | 64 bytes | ++--------+-------------------+----------------+ + +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport, +but none are reexportable over NFSv2. + +Linux server filehandles are a bit more complicated than this, though; +for example: + + - The (non-default) "subtreecheck" export option generally + requires another 4 to 8 bytes in the filehandle. + - If you export a subdirectory of a filesystem (instead of + exporting the filesystem root), that also usually adds 4 to 8 + bytes. + - If you export over NFSv2, knfsd usually uses a shorter + filesystem identifier that saves 8 bytes. + - The root directory of an export uses a filehandle that is + shorter. + +As you can see, the 128-byte NFSv4 filehandle is large enough that +you're unlikely to have trouble using NFSv4 to reexport any filesystem +exported from a Linux server. In general, if the original server is +something that also supports NFSv3, you're *probably* OK. Re-exporting +over NFSv3 may be dicier, and reexporting over NFSv2 will probably +never work. + +For more details of Linux filehandle structure, the best reference is +the source code and comments; see in particular: + + - include/linux/exportfs.h:enum fid_type + - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new + - fs/nfsd/nfsfh.c:set_version_and_fsid_type + - fs/nfs/export.c:nfs_encode_fh + +Open DENY bits ignored +---------------------- + +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which +allow you, for example, to open a file in a mode which forbids other +read opens or write opens. The Linux client doesn't use them, and the +server's support has always been incomplete: they are enforced only +against other NFS users, not against processes accessing the exported +filesystem locally. A reexport server will also not pass them along to +the original server, so they will not be enforced between clients of +different reexport servers. diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst index bb164eea969b..339efd75016a 100644 --- a/Documentation/filesystems/nfs/rpc-cache.rst +++ b/Documentation/filesystems/nfs/rpc-cache.rst @@ -78,7 +78,7 @@ Creating a Cache include taking references to shared objects. void update(struct cache_head \*orig, struct cache_head \*new) - Set the 'content' fileds in 'new' from 'orig'. + Set the 'content' fields in 'new' from 'orig'. int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h) Optional. Used to provide a /proc file that lists the diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst index 812754576845..5c1a1c58fc27 100644 --- a/Documentation/filesystems/nfs/rpc-server-gss.rst +++ b/Documentation/filesystems/nfs/rpc-server-gss.rst @@ -10,13 +10,12 @@ purposes of authentication.) RPCGSS is specified in a few IETF documents: - - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt - - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt + - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt -and there is a 3rd version being proposed: +There is a third version that we don't currently implement: - - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt - (At draft n. 02 at the time of writing) + - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt Background ========== @@ -30,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and depends on GSSAPI extensions that are KRB5 specific. GSSAPI is a complex library, and implementing it completely in kernel is -unwarranted. However GSSAPI operations are fundementally separable in 2 +unwarranted. However GSSAPI operations are fundamentally separable in 2 parts: - initial context establishment |