aboutsummaryrefslogtreecommitdiffstats
path: root/net/sunrpc
AgeCommit message (Collapse)Author
2020-05-20SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()")Chuck Lever
commit 0a8e7b7d08466b5fc52f8e96070acc116d82a8bb upstream. I've noticed that when krb5i or krb5p security is in use, retransmitted requests are missing the server's duplicate reply cache. The computed checksum on the retransmitted request does not match the cached checksum, resulting in the server performing the retransmitted request again instead of returning the cached reply. The assumptions made when removing xdr_buf_trim() were not correct. In the send paths, the upper layer has already set the segment lengths correctly, and shorting the buffer's content is simply a matter of reducing buf->len. xdr_buf_trim() is the right answer in the receive/unwrap path on both the client and the server. The buffer segment lengths have to be shortened one-by-one. On the server side in particular, head.iov_len needs to be updated correctly to enable nfsd_cache_csum() to work correctly. The simple buf->len computation doesn't do that, and that results in checksumming stale data in the buffer. The problem isn't noticed until there's significant instability of the RPC transport. At that point, the reliability of retransmit detection on the server becomes crucial. Fixes: 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20SUNRPC: Signalled ASYNC tasks need to exitChuck Lever
[ Upstream commit ce99aa62e1eb793e259d023c7f6ccb7c4879917b ] Ensure that signalled ASYNC rpc_tasks exit immediately instead of spinning until a timeout (or forever). To avoid checking for the signal flag on every scheduler iteration, the check is instead introduced in the client's finite state machine. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Fixes: ae67bd3821bb ("SUNRPC: Fix up task signalling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20SUNRPC: Fix GSS privacy computation of auth->au_ralignChuck Lever
[ Upstream commit a7e429a6fa6d612d1dacde96c885dc1bb4a9f400 ] When the au_ralign field was added to gss_unwrap_resp_priv, the wrong calculation was used. Setting au_rslack == au_ralign is probably correct for kerberos_v1 privacy, but kerberos_v2 privacy adds additional GSS data after the clear text RPC message. au_ralign needs to be smaller than au_rslack in that fairly common case. When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does exactly what I feared it would: it trims off part of the clear text RPC message. However, that's because rpc_prepare_reply_pages() does not set up the rq_rcv_buf's tail correctly because au_ralign is too large. Fixing the au_ralign computation also corrects the alignment of rq_rcv_buf->pages so that the client does not have to shift reply data payloads after they are received. Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20SUNRPC: Add "@len" parameter to gss_unwrap()Chuck Lever
[ Upstream commit 31c9590ae468478fe47dc0f5f0d3562b2f69450e ] Refactor: This is a pre-requisite to fixing the client-side ralign computation in gss_unwrap_resp_priv(). The length value is passed in explicitly rather that as the value of buf->len. This will subsequently allow gss_unwrap_kerberos_v1() to compute a slack and align value, instead of computing it in gss_unwrap_resp_priv(). Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20xprtrdma: Fix trace point use-after-free raceChuck Lever
[ Upstream commit bdb2ce82818577ba6e57b7d68b698b8d17329281 ] It's not safe to use resources pointed to by the @send_wr of ib_post_send() _after_ that function returns. Those resources are typically freed by the Send completion handler, which can run before ib_post_send() returns. Thus the trace points currently around ib_post_send() in the client's RPC/RDMA transport are a hazard, even when they are disabled. Rearrange them so that they touch the Work Request only _before_ ib_post_send() is invoked. Fixes: ab03eff58eb5 ("xprtrdma: Add trace points in RPC Call transmit paths") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-20xprtrdma: Clean up the post_send pathChuck Lever
[ Upstream commit 97d0de8812a10a66510ff95f8fe6e8d3053fd2ca ] Clean up: Simplify the synopses of functions in the post_send path by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-10SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purgeYihao Wu
[ Upstream commit 43e33924c38e8faeb0c12035481cb150e602e39d ] Deleting list entry within hlist_for_each_entry_safe is not safe unless next pointer (tmp) is protected too. It's not, because once hash_lock is released, cache_clean may delete the entry that tmp points to. Then cache_purge can walk to a deleted entry and tries to double free it. Fix this bug by holding only the deleted entry's reference. Suggested-by: NeilBrown <neilb@suse.de> Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NeilBrown <neilb@suse.de> [ cel: removed unused variable ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02svcrdma: Fix leak of svc_rdma_recv_ctxt objectsChuck Lever
commit 23cf1ee1f1869966b75518c59b5cbda4c6c92450 upstream. Utilize the xpo_release_rqst transport method to ensure that each rqstp's svc_rdma_recv_ctxt object is released even when the server cannot return a Reply for that rqstp. Without this fix, each RPC whose Reply cannot be sent leaks one svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped Receive buffer, and any pages that might be part of the Reply message. The leak is infrequent unless the network fabric is unreliable or Kerberos is in use, as GSS sequence window overruns, which result in connection loss, are more common on fast transports. Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02svcrdma: Fix trace point use-after-free raceChuck Lever
commit e28b4fc652c1830796a4d3e09565f30c20f9a2cf upstream. I hit this while testing nfsd-5.7 with kernel memory debugging enabled on my server: Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060 Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591 Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma] Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 <8b> 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00 Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286 Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030 Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246 Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004 Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80 Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000 Mar 30 13:21:45 klimt kernel: FS: 0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000 Mar 30 13:21:45 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0 Mar 30 13:21:45 klimt kernel: Call Trace: Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma] Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd] Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc] Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd] Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74 Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50 Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]--- It's absolutely not safe to use resources pointed to by the @send_wr argument of ib_post_send() _after_ that function returns. Those resources are typically freed by the Send completion handler, which can run before ib_post_send() returns. Thus the trace points currently around ib_post_send() in the server's RPC/RDMA transport are a hazard, even when they are disabled. Rearrange them so that they touch the Work Request only _before_ ib_post_send() is invoked. Fixes: bd2abef33394 ("svcrdma: Trace key RDMA API events") Fixes: 4201c7464753 ("svcrdma: Introduce svc_rdma_send_ctxt") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29SUNRPC: Fix backchannel RPC soft lockupsChuck Lever
commit 6221f1d9b63fed6260273e59a2b89ab30537a811 upstream. Currently, after the forward channel connection goes away, backchannel operations are causing soft lockups on the server because call_transmit_status's SOFTCONN logic ignores ENOTCONN. Such backchannel Calls are aggressively retried until the client reconnects. Backchannel Calls should use RPC_TASK_NOCONNECT rather than RPC_TASK_SOFTCONN. If there is no forward connection, the server is not capable of establishing a connection back to the client, thus that backchannel request should fail before the server attempts to send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN") was merged several years before RPC_TASK_NOCONNECT was available. Because setup_callback_client() explicitly sets NOPING, the NFSv4.0 callback connection depends on the first callback RPC to initiate a connection to the client. Thus NFSv4.0 needs to continue to use RPC_TASK_SOFTCONN. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # v4.20+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23sunrpc: Fix gss_unwrap_resp_integ() againChuck Lever
[ Upstream commit 4047aa909c4a40fceebc36fff708d465a4d3c6e2 ] xdr_buf_read_mic() tries to find unused contiguous space in a received xdr_buf in order to linearize the checksum for the call to gss_verify_mic. However, the corner cases in this code are numerous and we seem to keep missing them. I've just hit yet another buffer overrun related to it. This overrun is at the end of xdr_buf_read_mic(): 1284 if (buf->tail[0].iov_len != 0) 1285 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len; 1286 else 1287 mic->data = buf->head[0].iov_base + buf->head[0].iov_len; 1288 __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len); 1289 return 0; This logic assumes the transport has set the length of the tail based on the size of the received message. base + len is then supposed to be off the end of the message but still within the actual buffer. In fact, the length of the tail is set by the upper layer when the Call is encoded so that the end of the tail is actually the end of the allocated buffer itself. This causes the logic above to set mic->data to point past the end of the receive buffer. The "mic->data = head" arm of this if statement is no less fragile. As near as I can tell, this has been a problem forever. I'm not sure that minimizing au_rslack recently changed this pathology much. So instead, let's use a more straightforward approach: kmalloc a separate buffer to linearize the checksum. This is similar to how gss_validate() currently works. Coming back to this code, I had some trouble understanding what was going on. So I've cleaned up the variable naming and added a few comments that point back to the XDR definition in RFC 2203 to help guide future spelunkers, including myself. As an added clean up, the functionality that was in xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(), as that is its only caller. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23SUNRPC: fix krb5p mount to provide large enough buffer in rq_rcvsizeOlga Kornievskaia
[ Upstream commit df513a7711712758b9cb1a48d86712e7e1ee03f4 ] Ever since commit 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It used to use au_cslack value which was nice and large and changed it to au_rslack value which turns out to be too small. Since 5.1, v3 mount with sec=krb5p fails against an Ontap server because client's receive buffer it too small. For gss krb5p, we need to account for the mic token in the verifier, and the wrap token in the wrap token. RFC 4121 defines: mic token Octet no Name Description -------------------------------------------------------------- 0..1 TOK_ID Identification field. Tokens emitted by GSS_GetMIC() contain the hex value 04 04 expressed in big-endian order in this field. 2 Flags Attributes field, as described in section 4.2.2. 3..7 Filler Contains five octets of hex value FF. 8..15 SND_SEQ Sequence number field in clear text, expressed in big-endian order. 16..last SGN_CKSUM Checksum of the "to-be-signed" data and octet 0..15, as described in section 4.2.4. that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum wrap token Octet no Name Description -------------------------------------------------------------- 0..1 TOK_ID Identification field. Tokens emitted by GSS_Wrap() contain the hex value 05 04 expressed in big-endian order in this field. 2 Flags Attributes field, as described in section 4.2.2. 3 Filler Contains the hex value FF. 4..5 EC Contains the "extra count" field, in big- endian order as described in section 4.2.3. 6..7 RRC Contains the "right rotation count" in big- endian order, as described in section 4.2.5. 8..15 SND_SEQ Sequence number field in clear text, expressed in big-endian order. 16..last Data Encrypted data for Wrap tokens with confidentiality, or plaintext data followed by the checksum for Wrap tokens without confidentiality, as described in section 4.2.4. Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum (other things like padding) RFC 3961 defines known cksum sizes: Checksum type sumtype checksum section or value size reference --------------------------------------------------------------------- CRC32 1 4 6.1.3 rsa-md4 2 16 6.1.2 rsa-md4-des 3 24 6.2.5 des-mac 4 16 6.2.7 des-mac-k 5 8 6.2.8 rsa-md4-des-k 6 16 6.2.6 rsa-md5 7 16 6.1.1 rsa-md5-des 8 24 6.2.4 rsa-md5-des3 9 24 ?? sha1 (unkeyed) 10 20 ?? hmac-sha1-des3-kd 12 20 6.3 hmac-sha1-des3 13 20 ?? sha1 (unkeyed) 14 20 ?? hmac-sha1-96-aes128 15 20 [KRB5-AES] hmac-sha1-96-aes256 16 20 [KRB5-AES] [reserved] 0x8003 ? [GSS-KRB5] Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes. (GSS_KRB5_MAX_CKSUM_LEN) Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used for encoding the gss_wrap tokens (same tokens are used in reply). Fixes: 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-13xprtrdma: Fix DMA scatter-gather list mapping imbalanceChuck Lever
The @nents value that was passed to ib_dma_map_sg() has to be passed to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to concatenate sg entries, it will return a different nents value than it was passed. The bug was exposed by recent changes to the AMD IOMMU driver, which enabled sg entry concatenation. Looking all the way back to commit 4143f34e01e9 ("xprtrdma: Port to new memory registration API") and reviewing other kernel ULPs, it's not clear that the frwr_map() logic was ever correct for this case. Reported-by: Andre Tomt <andre@tomt.net> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-07Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Highlights: - Server-to-server copy code from Olga. To use it, client and both servers must have support, the target server must be able to access the source server over NFSv4.2, and the target server must have the inter_copy_offload_enable module parameter set. - Improvements and bugfixes for the new filehandle cache, especially in the container case, from Trond - Also from Trond, better reporting of write errors. - Y2038 work from Arnd" * tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits) sunrpc: expiry_time should be seconds not timeval nfsd: make nfsd_filecache_wq variable static nfsd4: fix double free in nfsd4_do_async_copy() nfsd: convert file cache to use over/underflow safe refcount nfsd: Define the file access mode enum for tracing nfsd: Fix a perf warning nfsd: Ensure sampling of the write verifier is atomic with the write nfsd: Ensure sampling of the commit verifier is atomic with the commit sunrpc: clean up cache entry add/remove from hashtable sunrpc: Fix potential leaks in sunrpc_cache_unhash() nfsd: Ensure exclusion between CLONE and WRITE errors nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range() nfsd: Update the boot verifier on stable writes too. nfsd: Fix stable writes nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create() nfsd: Reduce the number of calls to nfsd_file_gc() nfsd: Schedule the laundrette regularly irrespective of file errors nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN nfsd: Containerise filecache laundrette ...
2020-02-07Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Puyll NFS client updates from Anna Schumaker: "Stable bugfixes: - Fix memory leaks and corruption in readdir # v2.6.37+ - Directory page cache needs to be locked when read # v2.6.37+ New features: - Convert NFS to use the new mount API - Add "softreval" mount option to let clients use cache if server goes down - Add a config option to compile without UDP support - Limit the number of inactive delegations the client can cache at once - Improved readdir concurrency using iterate_shared() Other bugfixes and cleanups: - More 64-bit time conversions - Add additional diagnostic tracepoints - Check for holes in swapfiles, and add dependency on CONFIG_SWAP - Various xprtrdma cleanups to prepare for 5.7's changes - Several fixes for NFS writeback and commit handling - Fix acls over krb5i/krb5p mounts - Recover from premature loss of openstateids - Fix NFS v3 chacl and chmod bug - Compare creds using cred_fscmp() - Use kmemdup_nul() in more places - Optimize readdir cache page invalidation - Lease renewal and recovery fixes" * tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits) NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals NFSv4: try lease recovery on NFS4ERR_EXPIRED NFS: Fix memory leaks nfs: optimise readdir cache page invalidation NFS: Switch readdir to using iterate_shared() NFS: Use kmemdup_nul() in nfs_readdir_make_qstr() NFS: Directory page cache pages need to be locked when read NFS: Fix memory leaks and corruption in readdir SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id() NFS: Replace various occurrences of kstrndup() with kmemdup_nul() NFSv4: Limit the total number of cached delegations NFSv4: Add accounting for the number of active delegations held NFSv4: Try to return the delegation immediately when marked for return on close NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING NFS: nfs_find_open_context() should use cred_fscmp() NFS: nfs_access_get_cached_rcu() should use cred_fscmp() NFSv4: pnfs_roc() must use cred_fscmp() to compare creds NFS: remove unused macros nfs: Return EINVAL rather than ERANGE for mount parse errors ...
2020-02-07sunrpc: expiry_time should be seconds not timevalRoberto Bergantinos Corpas
When upcalling gssproxy, cache_head.expiry_time is set as a timeval, not seconds since boot. As such, RPC cache expiry logic will not clean expired objects created under auth.rpcsec.context cache. This has proven to cause kernel memory leaks on field. Using 64 bit variants of getboottime/timespec Expiration times have worked this way since 2010's c5b29f885afe "sunrpc: use seconds since boot in expiry cache". The gssproxy code introduced in 2012 added gss_proxy_save_rsc and introduced the bug. That's a while for this to lurk, but it required a bit of an extreme case to make it obvious. Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Cc: stable@vger.kernel.org Fixes: 030d794bf498 "SUNRPC: Use gssproxy upcall for server..." Tested-By: Frank Sorenson <sorenson@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-03SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()Trond Myklebust
Using kmemdup_nul() is more efficient when the length is known. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-29Merge tag 'y2038-drivers-for-v5.6-signed' of ↵Linus Torvalds
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull y2038 updates from Arnd Bergmann: "Core, driver and file system changes These are updates to device drivers and file systems that for some reason or another were not included in the kernel in the previous y2038 series. I've gone through all users of time_t again to make sure the kernel is in a long-term maintainable state, replacing all remaining references to time_t with safe alternatives. Some related parts of the series were picked up into the nfsd, xfs, alsa and v4l2 trees. A final set of patches in linux-mm removes the now unused time_t/timeval/timespec types and helper functions after all five branches are merged for linux-5.6, ensuring that no new users get merged. As a result, linux-5.6, or my backport of the patches to 5.4 [1], should be the first release that can serve as a base for a 32-bit system designed to run beyond year 2038, with a few remaining caveats: - All user space must be compiled with a 64-bit time_t, which will be supported in the coming musl-1.2 and glibc-2.32 releases, along with installed kernel headers from linux-5.6 or higher. - Applications that use the system call interfaces directly need to be ported to use the time64 syscalls added in linux-5.1 in place of the existing system calls. This impacts most users of futex() and seccomp() as well as programming languages that have their own runtime environment not based on libc. - Applications that use a private copy of kernel uapi header files or their contents may need to update to the linux-5.6 version, in particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h, linux/elfcore.h, linux/sockios.h, linux/timex.h and linux/can/bcm.h. - A few remaining interfaces cannot be changed to pass a 64-bit time_t in a compatible way, so they must be configured to use CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit timestamps. Most importantly this impacts all users of 'struct input_event'. - All y2038 problems that are present on 64-bit machines also apply to 32-bit machines. In particular this affects file systems with on-disk timestamps using signed 32-bit seconds: ext4 with ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs" [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame * tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits) Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC" y2038: sh: remove timeval/timespec usage from headers y2038: sparc: remove use of struct timex y2038: rename itimerval to __kernel_old_itimerval y2038: remove obsolete jiffies conversion functions nfs: fscache: use timespec64 in inode auxdata nfs: fix timstamp debug prints nfs: use time64_t internally sunrpc: convert to time64_t for expiry drm/etnaviv: avoid deprecated timespec drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC drm/msm: avoid using 'timespec' hfs/hfsplus: use 64-bit inode timestamps hostfs: pass 64-bit timestamps to/from user space packet: clarify timestamp overflow tsacct: add 64-bit btime field acct: stop using get_seconds() um: ubd: use 64-bit time_t where possible xtensa: ISS: avoid struct timeval dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD ...
2020-01-22sunrpc: clean up cache entry add/remove from hashtableTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22sunrpc: Fix potential leaks in sunrpc_cache_unhash()Trond Myklebust
When we unhash the cache entry, we need to handle any pending upcalls by calling cache_fresh_unlocked(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-15SUNRPC: Remove broken gss_mech_list_pseudoflavors()Trond Myklebust
Remove gss_mech_list_pseudoflavors() and its callers. This is part of an unused API, and could leak an RCU reference if it were ever called. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: DMA map rr_rdma_buf as each rpcrdma_rep is createdChuck Lever
Clean up: This simplifies the logic in rpcrdma_post_recvs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Destroy reps from previous connection instanceChuck Lever
To safely get rid of all rpcrdma_reps from a particular connection instance, xprtrdma has to wait until each of those reps is finished being used. A rep may be backing the rq_rcv_buf of an RPC that has just completed, for example. Since it is safe to invoke rpcrdma_rep_destroy() only in the Receive completion handler, simply mark reps remaining in the rb_all_reps list after the transport is drained. These will then be deleted as rpcrdma_post_recvs pulls them off the rep free list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Destroy rpcrdma_rep when Receive is flushedChuck Lever
This reduces the hardware and memory footprint of an unconnected transport. At some point in the future, transport reconnect will allow resolving the destination IP address through a different device. The current change enables reps for the new connection to be allocated on whichever NUMA node the new device affines to after a reconnect. Note that this does not destroy _all_ the transport's reps... there will be a few that are still part of a running RPC completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Allocate and map transport header buffers at connect timeChuck Lever
Currently the underlying RDMA device is chosen at transport set-up time. But it will soon be at connect time instead. The maximum size of a transport header is based on device capabilities. Thus transport header buffers have to be allocated _after_ the underlying device has been chosen (via address and route resolution); ie, in the connect worker. Thus, move the allocation of transport header buffers to the connect worker, after the point at which the underlying RDMA device has been chosen. This also means the RDMA device is available to do a DMA mapping of these buffers at connect time, instead of in the hot I/O path. Make that optimization as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Refactor frwr_is_supportedChuck Lever
Refactor: Perform the "is supported" check in rpcrdma_ep_create() instead of in rpcrdma_ia_open(). frwr_open() is where most of the logic to query device attributes is already located. The current code displays a redundant error message when the device does not support FRWR. As an additional clean-up, this patch removes the extra message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Eliminate per-transport "max pages"Chuck Lever
To support device hotplug and migrating a connection between devices of different capabilities, we have to guarantee that all in-kernel devices can support the same max NFS payload size (1 megabyte). This means that possibly one or two in-tree devices are no longer supported for NFS/RDMA because they cannot support 1MB rsize/wsize. The only one I confirmed was cxgb3, but it has already been removed from the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Refactor initialization of ep->rep_max_requestsChuck Lever
Clean up: there is no need to keep two copies of the same value. Also, in subsequent patches, rpcrdma_ep_create() will be called in the connect worker rather than at set-up time. Minor fix: Initialize the transport's sendctx to the value based on the capabilities of the underlying device, not the maximum setting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Make sendctx queue lifetime the same as connection lifetimeChuck Lever
The size of the sendctx queue depends on the value stored in ia->ri_max_send_sges. This value is determined by querying the underlying device. Eventually, rpcrdma_ia_open() and rpcrdma_ep_create() will be called in the connect worker rather than at transport set-up time. The underlying device will not have been chosen device set-up time. The sendctx queue will thus have to be created after the underlying device has been chosen via address and route resolution; in other words, in the connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Eliminate ri_max_send_sgesChuck Lever
Clean-up. The max_send_sge value also happens to be stored in ep->rep_attr. Let's keep just a single copy. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15SUNRPC: constify copied structureJulia Lawall
The empty_iov structure is only copied into another structure, so make it const. The opportunity for this change was found using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15SUNRPC: call_connect_status should handle -EPROTOChuck Lever
The xprtrdma connect logic can return -EPROTO if the underlying device or network path does not support RDMA. This can happen after a device removal/insertion. - When SOFTCONN is set, EPROTO is a permanent error. - When SOFTCONN is not set, EPROTO is treated as a temporary error. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15SUNRPC: Capture signalled RPC tasksChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15sunrpc: convert to time64_t for expiryArnd Bergmann
Using signed 32-bit types for UTC time leads to the y2038 overflow, which is what happens in the sunrpc code at the moment. This changes the sunrpc code over to use time64_t where possible. The one exception is the gss_import_v{1,2}_context() function for kerberos5, which uses 32-bit timestamps in the protocol. Here, we can at least treat the numbers as 'unsigned', which extends the range from 2038 to 2106. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix oops in Receive handler after device removalChuck Lever
Since v5.4, a device removal occasionally triggered this oops: Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219 Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page Dec 2 17:13:53 manet kernel: PGD 0 P4D 0 Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883 Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma] Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0 Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246 Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048 Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001 Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000 Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428 Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010 Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000 Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0 Dec 2 17:13:53 manet kernel: Call Trace: Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core] Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core] Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9 Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74 Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30 The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that is still pointing to the old ib_device, which has been freed. The only way that is possible is if this rpcrdma_rep was not destroyed by rpcrdma_ia_remove. Debugging showed that was indeed the case: this rpcrdma_rep was still in use by a completing RPC at the time of the device removal, and thus wasn't on the rep free list. So, it was not found by rpcrdma_reps_destroy(). The fix is to introduce a list of all rpcrdma_reps so that they all can be found when a device is removed. That list is used to perform only regbuf DMA unmapping, replacing that call to rpcrdma_reps_destroy(). Meanwhile, to prevent corruption of this list, I've moved the destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs(). rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus protecting the rb_all_reps list. Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix completion wait during device removalChuck Lever
I've found that on occasion, "rmmod <dev>" will hang while if an NFS is under load. Ensure that ri_remove_done is initialized only just before the transport is woken up to force a close. This avoids the completion possibly getting initialized again while the CM event handler is waiting for a wake-up. Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix create_qp crash on device unloadChuck Lever
On device re-insertion, the RDMA device driver crashes trying to set up a new QP: Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0 Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page Nov 27 16:32:06 manet kernel: PGD 0 P4D 0 Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852 Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12 Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 <f0> 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00 Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046 Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000 Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0 Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000 Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8 Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000 Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000 Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0 Nov 27 16:32:06 manet kernel: Call Trace: Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib] Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139 Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib] The fix is to copy the qp_init_attr struct that was just created by rpcrdma_ep_create() instead of using the one from the previous connection instance. Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-12-18nfs: use time64_t internallyArnd Bergmann
The timestamps for the cache are all in boottime seconds, so they don't overflow 32-bit values, but the use of time_t is deprecated because it generally does overflow when used with wall-clock time. There are multiple possible ways of avoiding it: - leave time_t, which is safe here, but forces others to look into this code to determine that it is over and over. - use a more generic type, like 'int' or 'long', which is known to be sufficient here but loses the documentation of referring to timestamps - use ktime_t everywhere, and convert into seconds in the few places where we want realtime-seconds. The conversion is sometimes expensive, but not more so than the conversion we do today. - use time64_t to clarify that this code is safe. Nothing would change for 64-bit architectures, but it is slightly less efficient on 32-bit architectures. Without a clear winner of the three approaches above, this picks the last one, favouring readability over a small performance loss on 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18sunrpc: convert to time64_t for expiryArnd Bergmann
Using signed 32-bit types for UTC time leads to the y2038 overflow, which is what happens in the sunrpc code at the moment. This changes the sunrpc code over to use time64_t where possible. The one exception is the gss_import_v{1,2}_context() function for kerberos5, which uses 32-bit timestamps in the protocol. Here, we can at least treat the numbers as 'unsigned', which extends the range from 2038 to 2106. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-07Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "This is a relatively quiet cycle for nfsd, mainly various bugfixes. Possibly most interesting is Trond's fixes for some callback races that were due to my incomplete understanding of rpc client shutdown. Unfortunately at the last minute I've started noticing a new intermittent failure to send callbacks. As the logic seems basically correct, I'm leaving Trond's patches in for now, and hope to find a fix in the next week so I don't have to revert those patches" * tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits) nfsd: depend on CRYPTO_MD5 for legacy client tracking NFSD fixing possible null pointer derefering in copy offload nfsd: check for EBUSY from vfs_rmdir/vfs_unink. nfsd: Ensure CLONE persists data and metadata changes to the target file SUNRPC: Fix backchannel latency metrics nfsd: restore NFSv3 ACL support nfsd: v4 support requires CRYPTO_SHA256 nfsd: Fix cld_net->cn_tfm initialization lockd: remove __KERNEL__ ifdefs sunrpc: remove __KERNEL__ ifdefs race in exportfs_decode_fh() nfsd: Drop LIST_HEAD where the variable it declares is never used. nfsd: document callback_wq serialization of callback code nfsd: mark cb path down on unknown errors nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback() nfsd: minor 4.1 callback cleanup SUNRPC: Fix svcauth_gss_proxy_init() SUNRPC: Trace gssproxy upcall results sunrpc: fix crash when cache_head become valid before update nfsd: remove private bin2hex implementation ...
2019-12-07Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy of a file from one source server to a different target server). - New RDMA tracepoints for debugging congestion control and Local Invalidate WRs. Bugfixes and cleanups - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for layoutreturn - Handle bad/dead sessions correctly in nfs41_sequence_process() - Various bugfixes to the delegation return operation. - Various bugfixes pertaining to delegations that have been revoked. - Cleanups to the NFS timespec code to avoid unnecessary conversions between timespec and timespec64. - Fix unstable RDMA connections after a reconnect - Close race between waking an RDMA sender and posting a receive - Wake pending RDMA tasks if connection fails - Fix MR list corruption, and clean up MR usage - Fix another RPCSEC_GSS issue with MIC buffer space" * tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) SUNRPC: Capture completion of all RPC tasks SUNRPC: Fix another issue with MIC buffer space NFS4: Trace lock reclaims NFS4: Trace state recovery operation NFSv4.2 fix memory leak in nfs42_ssc_open NFSv4.2 fix kfree in __nfs42_copy_file_range NFS: remove duplicated include from nfs4file.c NFSv4: Make _nfs42_proc_copy_notify() static NFS: Fallocate should use the nfs4_fattr_bitmap NFS: Return -ETXTBSY when attempting to write to a swapfile fs: nfs: sysfs: Remove NULL check before kfree NFS: remove unneeded semicolon NFSv4: add declaration of current_stateid NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process() nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list SUNRPC: Avoid RPC delays when exiting suspend NFS: Add a tracepoint in nfs_fh_to_dentry() NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done() NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn ...
2019-12-04kernel/notifier.c: remove blocking_notifier_chain_cond_register()Xiaoming Ni
blocking_notifier_chain_cond_register() does not consider system_booting state, which is the only difference between this function and blocking_notifier_cain_register(). This can be a bug and is a piece of duplicate code. Delete blocking_notifier_chain_cond_register() Link: http://lkml.kernel.org/r/1568861888-34045-4-git-send-email-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-22SUNRPC: Capture completion of all RPC tasksChuck Lever
RPC tasks on the backchannel never invoke xprt_complete_rqst(), so there is no way to report their tk_status at completion. Also, any RPC task that exits via rpc_exit_task() before it is replied to will also disappear without a trace. Introduce a trace point that is symmetrical with rpc_task_begin that captures the termination status of each RPC task. Sample trace output for callback requests initiated on the server: kworker/u8:12-448 [003] 127.025240: rpc_task_end: task:50@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task kworker/u8:12-448 [002] 127.567310: rpc_task_end: task:51@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task kworker/u8:12-448 [001] 130.506817: rpc_task_end: task:52@3 flags=ASYNC|DYNAMIC|SOFT|SOFTCONN|SENT runstate=RUNNING|ACTIVE status=0 action=rpc_exit_task Odd, though, that I never see trace_rpc_task_complete, either in the forward or backchannel. Should it be removed? Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-21SUNRPC: Fix backchannel latency metricsChuck Lever
I noticed that for callback requests, the reported backlog latency is always zero, and the rtt value is crazy big. The problem was that rqst->rq_xtime is never set for backchannel requests. Fixes: 78215759e20d ("SUNRPC: Make RTT measurement more ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-18SUNRPC: Fix another issue with MIC buffer spaceChuck Lever
xdr_shrink_pagelen() BUG's when @len is larger than buf->page_len. This can happen when xdr_buf_read_mic() is given an xdr_buf with a small page array (like, only a few bytes). Instead, just cap the number of bytes that xdr_shrink_pagelen() will move. Fixes: 5f1bc39979d ("SUNRPC: Fix buffer handling of GSS MIC ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18Merge tag 'nfs-rdma-for-5.5-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA Client Updates for Linux 5.5 New Features: - New tracepoints for congestion control and Local Invalidate WRs Bugfixes and Cleanups: - Eliminate log noise in call_reserveresult - Fix unstable connections after a reconnect - Clean up some code duplication - Close race between waking a sender and posting a receive - Fix MR list corruption, and clean up MR usage - Remove unused rpcrdma_sendctx fields - Try to avoid DMA mapping pages if it is too costly - Wake pending tasks if connection fails - Replace some dprintk()s with tracepoints
2019-11-06SUNRPC: Avoid RPC delays when exiting suspendTrond Myklebust
Jon Hunter: "I have been tracking down another suspend/NFS related issue where again I am seeing random delays exiting suspend. The delays can be up to a couple minutes in the worst case and this is causing a suspend test we have to fail." Change the use of a deferrable work to a standard delayed one. Reported-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Fixes: 7e0a0e38fcfea ("SUNRPC: Replace the queue timer with a delayed work function") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-03NFSv4.1: Don't rebind to the same source port when reconnecting to the serverTrond Myklebust
NFSv2, v3 and NFSv4 servers often have duplicate replay caches that look at the source port when deciding whether or not an RPC call is a replay of a previous call. This requires clients to perform strange TCP gymnastics in order to ensure that when they reconnect to the server, they bind to the same source port. NFSv4.1 and NFSv4.2 have sessions that provide proper replay semantics, that do not look at the source port of the connection. This patch therefore ensures they can ignore the rebind requirement. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-10-30SUNRPC: Fix svcauth_gss_proxy_init()Chuck Lever
gss_read_proxy_verf() assumes things about the XDR buffer containing the RPC Call that are not true for buffers generated by svc_rdma_recv(). RDMA's buffers look more like what the upper layer generates for sending: head is a kmalloc'd buffer; it does not point to a page whose contents are contiguous with the first page in the buffers' page array. The result is that ACCEPT_SEC_CONTEXT via RPC/RDMA has stopped working on Linux NFS servers that use gssproxy. This does not affect clients that use only TCP to send their ACCEPT_SEC_CONTEXT operation (that's all Linux clients). Other clients, like Solaris NFS clients, send ACCEPT_SEC_CONTEXT on the same transport as they send all other NFS operations. Such clients can send ACCEPT_SEC_CONTEXT via RPC/RDMA. I thought I had found every direct reference in the server RPC code to the rqstp->rq_pages field. Bug found at the 2019 Westford NFS bake-a-thon. Fixes: 3316f0631139 ("svcrdma: Persistently allocate and DMA- ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Bill Baker <bill.baker@oracle.com> Reviewed-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>