aboutsummaryrefslogtreecommitdiffstats
path: root/net/tipc/node.c
AgeCommit message (Collapse)Author
2020-06-02Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"Tuong Lien
This reverts commit de058420767df21e2b6b0f3bb36d1616fb962032. There is no actual tipc_node refcnt leak as stated in the above commit. The refcnt is hold carefully for the case of an asynchronous decryption (i.e. -EINPROGRESS/-EBUSY and skb = NULL is returned), so that the node object cannot be freed in the meantime. The counter will be re-balanced when the operation's callback arrives with the decrypted buffer if any. In other cases, e.g. a synchronous crypto the counter will be decreased immediately when it is done. Now with that commit, a kernel panic will occur when there is no node found (i.e. n = NULL) in the 'tipc_rcv()' or a premature release of the node object. This commit solves the issues by reverting the said commit, but keeping one valid case that the 'skb_linearize()' is failed. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: add support for broadcast rcv stats dumpingTuong Lien
This commit enables dumping the statistics of a broadcast-receiver link like the traditional 'broadcast-link' one (which is for broadcast- sender). The link dumping can be triggered via netlink (e.g. the iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the indicator. The name of a broadcast-receiver link of a specific peer will be in the format: 'broadcast-link:<peer-id>'. For example: Link <broadcast-link:1001002> Window:50 packets RX packets:7841 fragments:2408/440 bundles:0/0 TX packets:0 fragments:0/0 bundles:0/0 RX naks:0 defs:124 dups:0 TX naks:21 acks:0 retrans:0 Congestion link:0 Send queue max:0 avg:0 In addition, the broadcast-receiver link statistics can be reset in the usual way via netlink by specifying that link name in command. Note: the 'tipc_link_name_ext()' is removed because the link name can now be retrieved simply via the 'l->name'. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: enable broadcast retrans via unicastTuong Lien
In some environment, broadcast traffic is suppressed at high rate (i.e. a kind of bandwidth limit setting). When it is applied, TIPC broadcast can still run successfully. However, when it comes to a high load, some packets will be dropped first and TIPC tries to retransmit them but the packet retransmission is intentionally broadcast too, so making things worse and not helpful at all. This commit enables the broadcast retransmission via unicast which only retransmits packets to the specific peer that has really reported a gap i.e. not broadcasting to all nodes in the cluster, so will prevent from being suppressed, and also reduce some overheads on the other peers due to duplicates, finally improve the overall TIPC broadcast performance. Note: the functionality can be turned on/off via the sysctl file: echo 1 > /proc/sys/net/tipc/bc_retruni echo 0 > /proc/sys/net/tipc/bc_retruni Default is '0', i.e. the broadcast retransmission still works as usual. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26tipc: introduce Gap ACK blocks for broadcast linkTuong Lien
As achieved through commit 9195948fbf34 ("tipc: improve TIPC throughput by Gap ACK blocks"), we apply the same mechanism for the broadcast link as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will consist of two parts built for both the broadcast and unicast types: 31 16 15 0 +-------------+-------------+-------------+-------------+ | bgack_cnt | ugack_cnt | len | +-------------+-------------+-------------+-------------+ - | gap | ack | | +-------------+-------------+-------------+-------------+ > bc gacks : : : | +-------------+-------------+-------------+-------------+ - | gap | ack | | +-------------+-------------+-------------+-------------+ > uc gacks : : : | +-------------+-------------+-------------+-------------+ - which is "automatically" backward-compatible. We also increase the max number of Gap ACK blocks to 128, allowing upto 64 blocks per type (total buffer size = 516 bytes). Besides, the 'tipc_link_advance_transmq()' function is refactored which is applicable for both the unicast and broadcast cases now, so some old functions can be removed and the code is optimized. With the patch, TIPC broadcast is more robust regardless of packet loss or disorder, latency, ... in the underlying network. Its performance is boost up significantly. For example, experiment with a 5% packet loss rate results: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 0m 42.46s user 0m 1.16s sys 0m 17.67s Without the patch: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 8m 27.94s user 0m 0.55s sys 0m 2.38s Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18tipc: Fix potential tipc_node refcnt leak in tipc_rcvXiyu Yang
tipc_rcv() invokes tipc_node_find() twice, which returns a reference of the specified tipc_node object to "n" with increased refcnt. When tipc_rcv() returns or a new object is assigned to "n", the original local reference of "n" becomes invalid, so the refcount should be decreased to keep refcount balanced. The issue happens in some paths of tipc_rcv(), which forget to decrease the refcnt increased by tipc_node_find() and will cause a refcnt leak. Fix this issue by calling tipc_node_put() before the original object pointed by "n" becomes invalid. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26tipc: Add a missing case of TIPC_DIRECT_MSG typeHoang Le
In the commit f73b12812a3d ("tipc: improve throughput between nodes in netns"), we're missing a check to handle TIPC_DIRECT_MSG type, it's still using old sending mechanism for this message type. So, throughput improvement is not significant as expected. Besides that, when sending a large message with that type, we're also handle wrong receiving queue, it should be enqueued in socket receiving instead of multicast messages. Fix this by adding the missing case for TIPC_DIRECT_MSG. Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns") Reported-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-10tipc: make three functions staticChen Wandun
Fix the following sparse warning: net/tipc/node.c:281:6: warning: symbol 'tipc_node_free' was not declared. Should it be static? net/tipc/node.c:2801:5: warning: symbol '__tipc_nl_node_set_key' was not declared. Should it be static? net/tipc/node.c:2878:5: warning: symbol '__tipc_nl_node_flush_key' was not declared. Should it be static? Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Fixes: e1f32190cf7d ("tipc: add support for AEAD key setting via netlink") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-10tipc: introduce variable window congestion controlJon Maloy
We introduce a simple variable window congestion control for links. The algorithm is inspired by the Reno algorithm, covering both 'slow start', 'congestion avoidance', and 'fast recovery' modes. - We introduce hard lower and upper window limits per link, still different and configurable per bearer type. - We introduce a 'slow start theshold' variable, initially set to the maximum window size. - We let a link start at the minimum congestion window, i.e. in slow start mode, and then let is grow rapidly (+1 per rceived ACK) until it reaches the slow start threshold and enters congestion avoidance mode. - In congestion avoidance mode we increment the congestion window for each window-size number of acked packets, up to a possible maximum equal to the configured maximum window. - For each non-duplicate NACK received, we drop back to fast recovery mode, by setting the both the slow start threshold to and the congestion window to (current_congestion_window / 2). - If the timeout handler finds that the transmit queue has not moved since the previous timeout, it drops the link back to slow start and forces a probe containing the last sent sequence number to the sent to the peer, so that this can discover the stale situation. This change does in reality have effect only on unicast ethernet transport, as we have seen that there is no room whatsoever for increasing the window max size for the UDP bearer. For now, we also choose to keep the limits for the broadcast link unchanged and equal. This algorithm seems to give a 50-100% throughput improvement for messages larger than MTU. Suggested-by: Xin Long <lucien.xin@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-22tipc: update replicast capability for broadcast send linkHoang Le
When setting up a cluster with non-replicast/replicast capability supported. This capability will be disabled for broadcast send link in order to be backwards compatible. However, when these non-support nodes left and be removed out the cluster. We don't update this capability on broadcast send link. Then, some of features that based on this capability will also disabling as unexpected. In this commit, we make sure the broadcast send link capabilities will be re-calculated as soon as a node removed/rejoined a cluster. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: add support for AEAD key setting via netlinkTuong Lien
This commit adds two netlink commands to TIPC in order for user to be able to set or remove AEAD keys: - TIPC_NL_KEY_SET - TIPC_NL_KEY_FLUSH When the 'KEY_SET' is given along with the key data, the key will be initiated and attached to TIPC crypto. On the other hand, the 'KEY_FLUSH' command will remove all existing keys if any. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: introduce TIPC encryption & authenticationTuong Lien
This commit offers an option to encrypt and authenticate all messaging, including the neighbor discovery messages. The currently most advanced algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All encryption/decryption is done at the bearer layer, just before leaving or after entering TIPC. Supported features: - Encryption & authentication of all TIPC messages (header + data); - Two symmetric-key modes: Cluster and Per-node; - Automatic key switching; - Key-expired revoking (sequence number wrapped); - Lock-free encryption/decryption (RCU); - Asynchronous crypto, Intel AES-NI supported; - Multiple cipher transforms; - Logs & statistics; Two key modes: - Cluster key mode: One single key is used for both TX & RX in all nodes in the cluster. - Per-node key mode: Each nodes in the cluster has one specific TX key. For RX, a node requires its peers' TX key to be able to decrypt the messages from those peers. Key setting from user-space is performed via netlink by a user program (e.g. the iproute2 'tipc' tool). Internal key state machine: Attach Align(RX) +-+ +-+ | V | V +---------+ Attach +---------+ | IDLE |---------------->| PENDING |(user = 0) +---------+ +---------+ A A Switch| A | | | | | | Free(switch/revoked) | | (Free)| +----------------------+ | |Timeout | (TX) | | |(RX) | | | | | | v | +---------+ Switch +---------+ | PASSIVE |<----------------| ACTIVE | +---------+ (RX) +---------+ (user = 1) (user >= 1) The number of TFMs is 10 by default and can be changed via the procfs 'net/tipc/max_tfms'. At this moment, as for simplicity, this file is also used to print the crypto statistics at runtime: echo 0xfff1 > /proc/sys/net/tipc/max_tfms The patch defines a new TIPC version (v7) for the encryption message (- backward compatibility as well). The message is basically encapsulated as follows: +----------------------------------------------------------+ | TIPCv7 encryption | Original TIPCv2 | Authentication | | header | packet (encrypted) | Tag | +----------------------------------------------------------+ The throughput is about ~40% for small messages (compared with non- encryption) and ~9% for large messages. With the support from hardware crypto i.e. the Intel AES-NI CPU instructions, the throughput increases upto ~85% for small messages and ~55% for large messages. By default, the new feature is inactive (i.e. no encryption) until user sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO" in the kernel configuration to enable/disable the new code when needed. MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: enable creating a "preliminary" nodeTuong Lien
When user sets RX key for a peer not existing on the own node, a new node entry is needed to which the RX key will be attached. However, since the peer node address (& capabilities) is unknown at that moment, only the node-ID is provided, this commit allows the creation of a node with only the data that we call as “preliminary”. A preliminary node is not the object of the “tipc_node_find()” but the “tipc_node_find_by_id()”. Once the first message i.e. LINK_CONFIG comes from that peer, and is successfully decrypted by the own node, the actual peer node data will be properly updated and the node will function as usual. In addition, the node timer always starts when a node object is created so if a preliminary node is not used, it will be cleaned up. The later encryption functions will also use the node timer and be able to create a preliminary node automatically when needed. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07tipc: eliminate checking netns if node establishedHoang Le
Currently, we scan over all network namespaces at each received discovery message in order to check if the sending peer might be present in a host local namespaces. This is unnecessary since we can assume that a peer will not change its location during an established session. We now improve the condition for this testing so that we don't perform any redundant scans. Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06tipc: update cluster capabilities if node deletedHoang Le
There are two improvements when re-calculate cluster capabilities: - When deleting a specific down node, need to re-calculate. - In tipc_node_cleanup(), do not need to re-calculate if node is still existing in cluster. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29tipc: improve throughput between nodes in netnsHoang Le
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj v4: - clone traffics between netns to loopback Suggested-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-06net: tipc: have genetlink code to parse the attrs during dumpitJiri Pirko
Benefit from the fact that the generic netlink code can parse the attrs for dumpit op and avoid need to parse it in the op callback. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18tipc: clean up skb list lock handling on send pathJon Maloy
The policy for handling the skb list locks on the send and receive paths is simple. - On the send path we never need to grab the lock on the 'xmitq' list when the destination is an exernal node. - On the receive path we always need to grab the lock on the 'inputq' list, irrespective of source node. However, when transmitting node local messages those will eventually end up on the receive path of a local socket, meaning that the argument 'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the function tipc_sk_rcv(). This has been handled by always initializing the spinlock of the 'xmitq' list at message creation, just in case it may end up on the receive path later, and despite knowing that the lock in most cases never will be used. This approach is inaccurate and confusing, and has also concealed the fact that the stated 'no lock grabbing' policy for the send path is violated in some cases. We now clean up this by never initializing the lock at message creation, instead doing this at the moment we find that the message actually will enter the receive path. At the same time we fix the four locations where we incorrectly access the spinlock on the send/error path. This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock is initialised") which has now become redundant. CC: Eric Dumazet <edumazet@google.com> Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08tipc: add loopback device trackingJohn Rutherford
Since node internal messages are passed directly to the socket, it is not possible to observe those messages via tcpdump or wireshark. We now remedy this by making it possible to clone such messages and send the clones to the loopback interface. The clones are dropped at reception and have no functional role except making the traffic visible. The feature is enabled if network taps are active for the loopback device. pcap filtering restrictions require the messages to be presented to the receiving side of the loopback device. v3 - Function dev_nit_active used to check for network taps. - Procedure netif_rx_ni used to send cloned messages to loopback device. Signed-off-by: John Rutherford <john.rutherford@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25tipc: optimize link synching mechanismTuong Lien
This commit along with the next one are to resolve the issues with the link changeover mechanism. See that commit for details. Basically, for the link synching, from now on, we will send only one single ("dummy") SYNCH message to peer. The SYNCH message does not contain any data, just a header conveying the synch point to the peer. A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for backward compatible! Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Suggested-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17tipc: initialize 'validated' field of received packetsJon Maloy
The tipc_msg_validate() function leaves a boolean flag 'validated' in the validated buffer's control block, to avoid performing this action more than once. However, at reception of new packets, the position of this field may already have been set by lower layer protocols, so that the packet is erroneously perceived as already validated by TIPC. We fix this by initializing the said field to 'false' before performing the initial validation. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-25tipc: rename function msg_get_wrapped() to msg_inner_hdr()Jon Maloy
We rename the inline function msg_get_wrapped() to the more comprehensible msg_inner_hdr(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18tipc: fix issues with early FAILOVER_MSG from peerTuong Lien
It appears that a FAILOVER_MSG can come from peer even when the failure link is resetting (i.e. just after the 'node_write_unlock()'...). This means the failover procedure on the node has not been started yet. The situation is as follows: node1 node2 linkb linka linka linkb | | | | | | x failure | | | RESETTING | | | | | | x failure RESET | | RESETTING FAILINGOVER | | | (FAILOVER_MSG) | | |<-------------------------------------------------| | *FAILINGOVER | | | | | (dummy FAILOVER_MSG) | | |------------------------------------------------->| | RESET | | FAILOVER_END | FAILINGOVER RESET | . . . . . . . . . . . . Once this happens, the link failover procedure will be triggered wrongly on the receiving node since the node isn't in FAILINGOVER state but then another link failover will be carried out. The consequences are: 1) A peer might get stuck in FAILINGOVER state because the 'sync_point' was set, reset and set incorrectly, the criteria to end the failover would not be met, it could keep waiting for a message that has already received. 2) The early FAILOVER_MSG(s) could be queued in the link failover deferdq but would be purged or not pulled out because the 'drop_point' was not set correctly. 3) The early FAILOVER_MSG(s) could be dropped too. 4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state shortly, but later on it would be restarted. The same situation can also happen when the link is in PEER_RESET state and a FAILOVER_MSG arrives. The commit resolves the issues by forcing the link down immediately, so the failover procedure will be started normally (which is the same as when receiving a FAILOVER_MSG and the link is in up state). Also, the function "tipc_node_link_failover()" is toughen to avoid such a situation from happening. Acked-by: Jon Maloy <jon.maloy@ericsson.se> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04tipc: fix missing Name entries due to half-failoverTuong Lien
TIPC link can temporarily fall into "half-establish" that only one of the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL messages, whereas the other link endpoint is not up (e.g. immediately when the endpoint receives ACTIVATE_MSG, the network interface goes down...). This is a normal situation and will be settled because the link endpoint will be eventually brought down after the link tolerance time. However, the situation will become worse when the second link is established before the first link endpoint goes down, For example: 1. Both links <1A-2A>, <1B-2B> down 2. Link endpoint 2A up, but 1A still down (e.g. due to network disturbance, wrong session, etc.) 3. Link <1B-2B> up 4. Link endpoint 2A down (e.g. due to link tolerance timeout) 5. Node B starts failover onto link <1B-2B> ==> Node A does never start link failover. When the "half-failover" situation happens, two consequences have been observed: a) Peer link/node gets stuck in FAILINGOVER state; b) Traffic or user messages that peer node is trying to failover onto the second link can be partially or completely dropped by this node. The consequence a) was actually solved by commit c140eb166d68 ("tipc: fix failover problem"), but that commit didn't cover the b). It's due to the fact that the tunnel link endpoint has never been prepared for a failover, so the 'l->drop_point' (and the other data...) is not set correctly. When a TUNNEL_MSG from peer node arrives on the link, depending on the inner message's seqno and the current 'l->drop_point' value, the message can be dropped (- treated as a duplicate message) or processed. At this early stage, the traffic messages from peer are likely to be NAME_DISTRIBUTORs, this means some name table entries will be missed on the node forever! The commit resolves the issue by starting the FAILOVER process on this node as well. Another benefit from this solution is that we ensure the link will not be re-established until the failover ends. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11tipc: use standard write_lock & unlock functions when creating nodeJon Maloy
In the function tipc_node_create() we protect the peer capability field by using the node rw_lock. However, we access the lock directly instead of using the dedicated functions for this, as we do everywhere else in node.c. This cosmetic spot is fixed here. Fixes: 40999f11ce67 ("tipc: make link capability update thread safe") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-03-23tipc: tipc clang warningJon Maloy
When checking the code with clang -Wsometimes-uninitialized we get the following warning: if (!tipc_link_is_establishing(l)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/tipc/node.c:847:46: note: uninitialized use occurs here tipc_bearer_xmit(n->net, bearer_id, &xmitq, maddr); net/tipc/node.c:831:2: note: remove the 'if' if its condition is always true if (!tipc_link_is_establishing(l)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/tipc/node.c:821:31: note: initialize the variable 'maddr' to silence this warning struct tipc_media_addr *maddr; We fix this by initializing 'maddr' to NULL. For the matter of clarity, we also test if 'xmitq' is non-empty before we use it and 'maddr' further down in the function. It will never happen that 'xmitq' is non- empty at the same time as 'maddr' is NULL, so this is a sufficient test. Fixes: 598411d70f85 ("tipc: make resetting of links non-atomic") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tipc: introduce new capability flag for clusterHoang Le
As a preparation for introducing a smooth switching between replicast and broadcast method for multicast message, We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature. During a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field. The new feature is only applicable if all node in the cluster supports this new capability. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11tipc: fix link session and re-establish issuesTuong Lien
When a link endpoint is re-created (e.g. after a node reboot or interface reset), the link session number is varied by random, the peer endpoint will be synced with this new session number before the link is re-established. However, there is a shortcoming in this mechanism that can lead to the link never re-established or faced with a failure then. It happens when the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as well as the 'in_session' flag have been set, but suddenly this link endpoint leaves. When it comes back with a random session number, there are two situations possible: 1/ If the random session number is larger than (or equal to) the previous one, the peer endpoint will be updated with this new session upon receipt of a RESET_MSG from this endpoint, and the link can be re- established as normal. Otherwise, all the RESET_MSGs from this endpoint will be rejected by the peer. In turn, when this link endpoint receives one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start to send STATE_MSGs, but again these messages will be dropped by the peer due to wrong session. The peer link endpoint can still become ESTABLISHED after receiving a traffic message from this endpoint (e.g. a BCAST_PROTOCOL or NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link will be forced down sooner or later! Even in case the random session number is larger than the previous one, it can be that the ACTIVATE_MSG from the peer arrives first, and this link endpoint moves quickly to ESTABLISHED without sending out any RESET_MSG yet. Consequently, the peer link will not be updated with the new session number, and the same link failure scenario as above will happen. 2/ Another situation can be that, the peer link endpoint was reset due to any reasons in the meantime, its link state was set to RESET from ESTABLISHING but still in session, i.e. the 'in_session' flag is not reset... Now, if the random session number from this endpoint is less than the previous one, all the RESET_MSGs from this endpoint will be rejected by the peer. In the other direction, when this link endpoint receives a RESET_MSG from the peer, it moves to ESTABLISHING and starts to send ACTIVATE_MSGs, but all these messages will be rejected by the peer too. As a result, the link cannot be re-established but gets stuck with this link endpoint in state ESTABLISHING and the peer in RESET! Solution: =========== This link endpoint should not go directly to ESTABLISHED when getting ACTIVATE_MSG from the peer which may belong to the old session if the link was re-created. To ensure the session to be correct before the link is re-established, the peer endpoint in ESTABLISHING state will send back the last session number in ACTIVATE_MSG for a verification at this endpoint. Then, if needed, a new and more appropriate session number will be regenerated to force a re-synch first. In addition, when a link in ESTABLISHING state is reset, its state will move to RESET according to the link FSM, along with resetting the 'in_session' flag (and the other data) as a normal link reset, it will also be deleted if requested. The solution is backward compatible. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19tipc: add trace_events for tipc nodeTuong Lien
The commit adds the new trace_events for TIPC node object: trace_tipc_node_create() trace_tipc_node_delete() trace_tipc_node_lost_contact() trace_tipc_node_timeout() trace_tipc_node_link_up() trace_tipc_node_link_down() trace_tipc_node_reset_links() trace_tipc_node_fsm_evt() trace_tipc_node_check_state() Also, enables the traces for the following cases: - When a node is created/deleted; - When a node contact is lost; - When a node timer is timed out; - When a node link is up/down; - When all node links are reset; - When node state is changed; - When a skb comes and node state needs to be checked/updated. Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19tipc: add trace_events for tipc linkTuong Lien
The commit adds the new trace_events for TIPC link object: trace_tipc_link_timeout() trace_tipc_link_fsm() trace_tipc_link_reset() trace_tipc_link_too_silent() trace_tipc_link_retrans() trace_tipc_link_bc_ack() trace_tipc_link_conges() And the traces for PROTOCOL messages at building and receiving: trace_tipc_proto_build() trace_tipc_proto_rcv() Note: a) The 'tipc_link_too_silent' event will only happen when the 'silent_intv_cnt' is about to reach the 'abort_limit' value (and the event is enabled). The benefit for this kind of event is that we can get an early indication about TIPC link loss issue due to timeout, then can do some necessary actions for troubleshooting. For example: To trigger the 'tipc_proto_rcv' when the 'too_silent' event occurs: echo 'enable_event:tipc:tipc_proto_rcv' > \ events/tipc/tipc_link_too_silent/trigger And disable it when TIPC link is reset: echo 'disable_event:tipc:tipc_proto_rcv' > \ events/tipc/tipc_link_reset/trigger b) The 'tipc_link_retrans' or 'tipc_link_bc_ack' event is useful to trace TIPC retransmission issues. In addition, the commit adds the 'trace_tipc_list/link_dump()' at the 'retransmission failure' case. Then, if the issue occurs, the link 'transmq' along with the link data can be dumped for post-analysis. These dump events should be enabled by default since it will only take effect when the failure happens. The same approach is also applied for the faulty case that the validation of protocol message is failed. Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19tipc: enable tracepoints in tipcTuong Lien
As for the sake of debugging/tracing, the commit enables tracepoints in TIPC along with some general trace_events as shown below. It also defines some 'tipc_*_dump()' functions that allow to dump TIPC object data whenever needed, that is, for general debug purposes, ie. not just for the trace_events. The following trace_events are now available: - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data, e.g. message type, user, droppable, skb truesize, cloned skb, etc. - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or queues, e.g. TIPC link transmq, socket receive queue, etc. - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g. sk state, sk type, connection type, rmem_alloc, socket queues, etc. - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g. link state, silent_intv_cnt, gap, bc_gap, link queues, etc. - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g. node state, active links, capabilities, link entries, etc. How to use: Put the trace functions at any places where we want to dump TIPC data or events. Note: a) The dump functions will generate raw data only, that is, to offload the trace event's processing, it can require a tool or script to parse the data but this should be simple. b) The trace_tipc_*_dump() should be reserved for a failure cases only (e.g. the retransmission failure case) or where we do not expect to happen too often, then we can consider enabling these events by default since they will almost not take any effects under normal conditions, but once the rare condition or failure occurs, we get the dumped data fully for post-analysis. For other trace purposes, we can reuse these trace classes as template but different events. c) A trace_event is only effective when we enable it. To enable the TIPC trace_events, echo 1 to 'enable' files in the events/tipc/ directory in the 'debugfs' file system. Normally, they are located at: /sys/kernel/debug/tracing/events/tipc/ For example: To enable the tipc_link_dump event: echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable To enable all the TIPC trace_events: echo 1 > /sys/kernel/debug/tracing/events/tipc/enable To collect the trace data: cat trace or cat trace_pipe > /trace.out & To disable all the TIPC trace_events: echo 0 > /sys/kernel/debug/tracing/events/tipc/enable To clear the trace buffer: echo > trace d) Like the other trace_events, the feature like 'filter' or 'trigger' is also usable for the tipc trace_events. For more details, have a look at: Documentation/trace/ftrace.txt MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18tipc: handle broadcast NAME_DISTRIBUTOR packet when receiving itZhenbo Gao
NAME_DISTRIBUTOR messages are transmitted through unicast link on TIPC 2.0, by contrast, the messages are delivered through broadcast link on TIPC 1.7. But at present, NAME_DISTRIBUTOR messages received by broadcast link cannot be handled in tipc_rcv() until an unicast message arrives, which may lead to a significant delay to update name table. To avoid this delay, we will also deal with broadcast NAME_DISTRIBUTOR message on broadcast receive path. Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05tipc: fix node keep alive interval calculationHoang Le
When setting LINK tolerance, node timer interval will be calculated base on the LINK with lowest tolerance. But when calculated, the old node timer interval only updated if current setting value (tolerance/4) less than old ones regardless of number of links as well as links' lowest tolerance value. This caused to two cases missing if tolerance changed as following: Case 1: 1.1/ There is one link (L1) available in the system 1.2/ Set L1's tolerance from 1500ms => lower (i.e 500ms) 1.3/ Then, fallback to default (1500ms) or higher (i.e 2000ms) Expected: node timer interval is 1500/4=375ms after 1.3 Result: node timer interval will not being updated after changing tolerance at 1.3 since its value 1500/4=375ms is not less than 500/4=125ms at 1.2. Case 2: 2.1/ There are two links (L1, L2) available in the system 2.2/ L1 and L2 tolerance value are 2000ms as initial 2.3/ Set L2's tolerance from 2000ms => lower 1500ms 2.4/ Disable link L2 (bring down its bearer) Expected: node timer interval is 2000ms/4=500ms after 2.4 Result: node timer interval will not being updated after disabling L2 since its value 2000ms/4=500ms is still not less than 1500/4=375ms at 2.3 although L2 is already not available in the system. To fix this, we start the node interval calculation by initializing it to a value larger than any conceivable calculated value. This way, the link with the lowest tolerance will always determine the calculated value. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27tipc: fix lockdep warning during node deleteJon Maloy
We see the following lockdep warning: [ 2284.078521] ====================================================== [ 2284.078604] WARNING: possible circular locking dependency detected [ 2284.078604] 4.19.0+ #42 Tainted: G E [ 2284.078604] ------------------------------------------------------ [ 2284.078604] rmmod/254 is trying to acquire lock: [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0 [ 2284.078604] [ 2284.078604] but task is already holding lock: [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc] [ 2284.078604] [ 2284.078604] which lock already depends on the new lock. [ 2284.078604] [ 2284.078604] [ 2284.078604] the existing dependency chain (in reverse order) is: [ 2284.078604] [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}: [ 2284.078604] tipc_node_timeout+0x20a/0x330 [tipc] [ 2284.078604] call_timer_fn+0xa1/0x280 [ 2284.078604] run_timer_softirq+0x1f2/0x4d0 [ 2284.078604] __do_softirq+0xfc/0x413 [ 2284.078604] irq_exit+0xb5/0xc0 [ 2284.078604] smp_apic_timer_interrupt+0xac/0x210 [ 2284.078604] apic_timer_interrupt+0xf/0x20 [ 2284.078604] default_idle+0x1c/0x140 [ 2284.078604] do_idle+0x1bc/0x280 [ 2284.078604] cpu_startup_entry+0x19/0x20 [ 2284.078604] start_secondary+0x187/0x1c0 [ 2284.078604] secondary_startup_64+0xa4/0xb0 [ 2284.078604] [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}: [ 2284.078604] del_timer_sync+0x34/0xa0 [ 2284.078604] tipc_node_delete+0x1a/0x40 [tipc] [ 2284.078604] tipc_node_stop+0xcb/0x190 [tipc] [ 2284.078604] tipc_net_stop+0x154/0x170 [tipc] [ 2284.078604] tipc_exit_net+0x16/0x30 [tipc] [ 2284.078604] ops_exit_list.isra.8+0x36/0x70 [ 2284.078604] unregister_pernet_operations+0x87/0xd0 [ 2284.078604] unregister_pernet_subsys+0x1d/0x30 [ 2284.078604] tipc_exit+0x11/0x6f2 [tipc] [ 2284.078604] __x64_sys_delete_module+0x1df/0x240 [ 2284.078604] do_syscall_64+0x66/0x460 [ 2284.078604] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2284.078604] [ 2284.078604] other info that might help us debug this: [ 2284.078604] [ 2284.078604] Possible unsafe locking scenario: [ 2284.078604] [ 2284.078604] CPU0 CPU1 [ 2284.078604] ---- ---- [ 2284.078604] lock(&(&tn->node_list_lock)->rlock); [ 2284.078604] lock((&n->timer)#2); [ 2284.078604] lock(&(&tn->node_list_lock)->rlock); [ 2284.078604] lock((&n->timer)#2); [ 2284.078604] [ 2284.078604] *** DEADLOCK *** [ 2284.078604] [ 2284.078604] 3 locks held by rmmod/254: [ 2284.078604] #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30 [ 2284.078604] #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc] [ 2284.078604] #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19 [...} The reason is that the node timer handler sometimes needs to delete a node which has been disconnected for too long. To do this, it grabs the lock 'node_list_lock', which may at the same time be held by the generic node cleanup function, tipc_node_stop(), during module removal. Since the latter is calling del_timer_sync() inside the same lock, we have a potential deadlock. We fix this letting the timer cleanup function use spin_trylock() instead of just spin_lock(), and when it fails to grab the lock it just returns so that the timer handler can terminate its execution. This is safe to do, since tipc_node_stop() anyway is about to delete both the timer and the node instance. Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance") Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tipc: ignore STATE_MSG on wrong link sessionLUU Duc Canh
The initial session number when a link is created is based on a random value, taken from struct tipc_net->random. It is then incremented for each link reset to avoid mixing protocol messages from different link sessions. However, when a bearer is reset all its links are deleted, and will later be re-created using the same random value as the first time. This means that if the link never went down between creation and deletion we will still sometimes have two subsequent sessions with the same session number. In virtual environments with potentially long transmission times this has turned out to be a real problem. We now fix this by randomizing the session number each time a link is created. With a session number size of 16 bits this gives a risk of session collision of 1/64k. To reduce this further, we also introduce a sanity check on the very first STATE message arriving at a link. If this has an acknowledge value differing from 0, which is logically impossible, we ignore the message. The final risk for session collision is hence reduced to 1/4G, which should be sufficient. Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-29tipc: fix failover problemLUU Duc Canh
We see the following scenario: 1) Link endpoint B on node 1 discovers that its peer endpoint is gone. Since there is a second working link, failover procedure is started. 2) Link endpoint A on node 1 sends a FAILOVER message to peer endpoint A on node 2. The node item 1->2 goes to state FAILINGOVER. 3) Linke endpoint A/2 receives the failover, and is supposed to take down its parallell link endpoint B/2, while producing a FAILOVER message to send back to A/1. 4) However, B/2 has already been deleted, so no FAILOVER message can created. 5) Node 1->2 remains in state FAILINGOVER forever, refusing to receive any messages that can bring B/1 up again. We are left with a non- redundant link between node 1 and 2. We fix this with letting endpoint A/2 build a dummy FAILOVER message to send to back to A/1, so that the situation can be resolved. Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20tipc: make link capability update thread safeJon Maloy
The commit referred to below introduced an update of the link capabilities field that is not safe. Given the recently added feature to remove idle node and link items after 5 minutes, there is a small risk that the update will happen at the very moment the targeted link is being removed. To avoid this we have to perform the update inside the node item's write lock protection. Fixes: 9012de508956 ("tipc: add sequence number check for link STATE messages") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: check session number before accepting link protocol messagesJon Maloy
In some virtual environments we observe a significant higher number of packet reordering and delays than we have been used to traditionally. This makes it necessary with stricter checks on incoming link protocol messages' session number, which until now only has been validated for RESET messages. Since the other two message types, ACTIVATE and STATE messages also carry this number, it is easy to extend the validation check to those messages. We also introduce a flag indicating if a link has a valid peer session number or not. This eliminates the mixing of 32- and 16-bit arithmethics we are currently using to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11tipc: add sequence number check for link STATE messagesJon Maloy
Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07tipc: fix wrong return value from function tipc_node_try_addr()Jon Maloy
The function for checking if there is an node address conflict is supposed to return a suggestion for a new address if it finds a conflict, and zero otherwise. But in case the peer being checked is previously unknown it does instead return a "suggestion" for the checked address itself. This results in a DSC_TRIAL_FAIL_MSG being sent unecessarily to the peer, and sometimes makes the trial period starting over again. Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30tipc: Auto removal of peer down node instanceGhantaKrishnamurthy MohanKrishna
A peer node is considered down if there are no active links (or) lost contact to the node. In current implementation, a peer node instance is deleted either if a) TIPC module is removed (or) b) Application can use a netlink/iproute2 interface to delete a specific down node. Thus, a down node instance lives in the system forever, unless the application explicitly removes it. We fix this by deleting the nodes which are down for a specified amount of time (5 minutes). Existing node supervision timer is used to achieve this. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-30tipc: optimize function tipc_node_timeout()Tung Nguyen
In single-link usage, the function tipc_node_timeout() still iterates over the whole link array to handle each link. Given that the maximum number of bearers are 3, there are 2 redundant iterations with lock grab/release. Since this function is executing very frequently it makes sense to optimize it. This commit adds conditional checking to exit from the loop if the known number of configured links has already been accessed. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The bpf syscall and selftests conflicts were trivial overlapping changes. The r8169 change involved moving the added mdelay from 'net' into a different function. A TLS close bug fix overlapped with the splitting of the TLS state into separate TX and RX parts. I just expanded the tests in the bug fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf == X". Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10tipc: eliminate KMSAN uninit-value in strcmp complaintYing Xue
When we get link properties through netlink interface with tipc_nl_node_get_link(), we don't validate TIPC_NLA_LINK_NAME attribute at all, instead we directly use it. As a consequence, KMSAN detected the TIPC_NLA_LINK_NAME attribute was an uninitialized value, and then posted the following complaint: ================================================================== BUG: KMSAN: uninit-value in strcmp+0xf7/0x160 lib/string.c:329 CPU: 1 PID: 4527 Comm: syz-executor655 Not tainted 4.16.0+ #87 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683 strcmp+0xf7/0x160 lib/string.c:329 tipc_nl_node_get_link+0x220/0x6f0 net/tipc/node.c:1881 genl_family_rcv_msg net/netlink/genetlink.c:599 [inline] genl_rcv_msg+0x1686/0x1810 net/netlink/genetlink.c:624 netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2447 genl_rcv+0x63/0x80 net/netlink/genetlink.c:635 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline] netlink_unicast+0x166b/0x1740 net/netlink/af_netlink.c:1337 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046 __sys_sendmsg net/socket.c:2080 [inline] SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091 SyS_sendmsg+0x54/0x80 net/socket.c:2087 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x445589 RSP: 002b:00007fb7ee66cdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 0000000000445589 RDX: 0000000000000000 RSI: 0000000020023000 RDI: 0000000000000003 RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fffa2bf3f3f R14: 00007fb7ee66d9c0 R15: 0000000000000001 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline] netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046 __sys_sendmsg net/socket.c:2080 [inline] SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091 SyS_sendmsg+0x54/0x80 net/socket.c:2087 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 ================================================================== To quiet the complaint, TIPC_NLA_LINK_NAME attribute has been validated in tipc_nl_node_get_link() before it's used. Reported-by: syzbot+df0257c92ffd4fcc58cd@syzkaller.appspotmail.com Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Overlapping changes in selftests Makefile. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27tipc: introduce ioctl for fetching node identityJon Maloy
After the introduction of a 128-bit node identity it may be difficult for a user to correlate between this identity and the generated node hash address. We now try to make this easier by introducing a new ioctl() call for fetching a node identity by using the hash value as key. This will be particularly useful when we extend some of the commands in the 'tipc' tool, but we also expect regular user applications to need this feature. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27tipc: fix bug in function tipc_nl_node_dump_monitorJon Maloy
Commit 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor summary") intended to fix a problem with user tool looping when max number of bearers are enabled. Unfortunately, the wrong version of the commit was posted, so the problem was not solved at all. This commit adds the missing part. Fixes: 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor summary") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>