summaryrefslogtreecommitdiffstats
path: root/net/tipc
AgeCommit message (Collapse)Author
2016-06-29tipc: honor msg2addr return valueRichard Alpe
The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which prior to this patch was unhanded in the caller. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27tipc: Use kmemdup instead of kmalloc and memcpyAmitoj Kaur Chawla
Replace calls to kmalloc followed by a memcpy with a direct call to kmemdup. The Coccinelle semantic patch used to make this change is as follows: @@ expression from,to,size,flag; statement S; @@ - to = \(kmalloc\|kzalloc\)(size,flag); + to = kmemdup(from,size,flag); if (to==NULL || ...) S - memcpy(to, from, size); Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22tipc: unclone unbundled buffers before forwardingJon Paul Maloy
When extracting an individual message from a received "bundle" buffer, we just create a clone of the base buffer, and adjust it to point into the right position of the linearized data area of the latter. This works well for regular message reception, but during periods of extremely high load it may happen that an extracted buffer, e.g, a connection probe, is reversed and forwarded through an external interface while the preceding extracted message is still unhandled. When this happens, the header or data area of the preceding message will be partially overwritten by a MAC header, leading to unpredicatable consequences, such as a link reset. We now fix this by ensuring that the msg_reverse() function never returns a cloned buffer, and that the returned buffer always contains sufficient valid head and tail room to be forwarded. Reported-by: Erik Hugne <erik.hugne@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17tipc: fix socket timer deadlockJon Paul Maloy
We sometimes observe a 'deadly embrace' type deadlock occurring between mutually connected sockets on the same node. This happens when the one-hour peer supervision timers happen to expire simultaneously in both sockets. The scenario is as follows: CPU 1: CPU 2: -------- -------- tipc_sk_timeout(sk1) tipc_sk_timeout(sk2) lock(sk1.slock) lock(sk2.slock) msg_create(probe) msg_create(probe) unlock(sk1.slock) unlock(sk2.slock) tipc_node_xmit_skb() tipc_node_xmit_skb() tipc_node_xmit() tipc_node_xmit() tipc_sk_rcv(sk2) tipc_sk_rcv(sk1) lock(sk2.slock) lock((sk1.slock) filter_rcv() filter_rcv() tipc_sk_proto_rcv() tipc_sk_proto_rcv() msg_create(probe_rsp) msg_create(probe_rsp) tipc_sk_respond() tipc_sk_respond() tipc_node_xmit_skb() tipc_node_xmit_skb() tipc_node_xmit() tipc_node_xmit() tipc_sk_rcv(sk1) tipc_sk_rcv(sk2) lock((sk1.slock) lock((sk2.slock) ===> DEADLOCK ===> DEADLOCK Further analysis reveals that there are three different locations in the socket code where tipc_sk_respond() is called within the context of the socket lock, with ensuing risk of similar deadlocks. We now solve this by passing a buffer queue along with all upcalls where sk_lock.slock may potentially be held. Response or rejected message buffers are accumulated into this queue instead of being sent out directly, and only sent once we know we are safely outside the slock context. Reported-by: GUNA <gbalasun@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17tipc: potential shift wrapping bug in map_set()Dan Carpenter
"up_map" is a u64 type but we're not using the high 32 bits. Fixes: 35c55c9877f8 ('tipc: add neighbor monitoring framework') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15tipc: eliminate uninitialized variable warningYing Xue
net/tipc/link.c: In function ‘tipc_link_timeout’: net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized] Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15tipc: fix suspicious RCU usageYing Xue
When run tipcTS&tipcTC test suite, the following complaint appears: [ 56.926168] =============================== [ 56.926169] [ INFO: suspicious RCU usage. ] [ 56.926171] 4.7.0-rc1+ #160 Not tainted [ 56.926173] ------------------------------- [ 56.926174] net/tipc/bearer.c:408 suspicious rcu_dereference_protected() usage! [ 56.926175] [ 56.926175] other info that might help us debug this: [ 56.926175] [ 56.926177] [ 56.926177] rcu_scheduler_active = 1, debug_locks = 1 [ 56.926179] 3 locks held by swapper/4/0: [ 56.926180] #0: (((&req->timer))){+.-...}, at: [<ffffffff810e79b5>] call_timer_fn+0x5/0x340 [ 56.926203] #1: (&(&req->lock)->rlock){+.-...}, at: [<ffffffffa000c29b>] disc_timeout+0x1b/0xd0 [tipc] [ 56.926212] #2: (rcu_read_lock){......}, at: [<ffffffffa00055e0>] tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc] [ 56.926218] [ 56.926218] stack backtrace: [ 56.926221] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.7.0-rc1+ #160 [ 56.926222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 56.926224] 0000000000000000 ffff880016803d28 ffffffff813c4423 ffff8800154252c0 [ 56.926227] 0000000000000001 ffff880016803d58 ffffffff810b7512 ffff8800124d8120 [ 56.926230] ffff880013f8a160 ffff8800132b5ccc ffff8800124d8120 ffff880016803d88 [ 56.926234] Call Trace: [ 56.926235] <IRQ> [<ffffffff813c4423>] dump_stack+0x67/0x94 [ 56.926250] [<ffffffff810b7512>] lockdep_rcu_suspicious+0xe2/0x120 [ 56.926256] [<ffffffffa00051f1>] tipc_l2_send_msg+0x131/0x1c0 [tipc] [ 56.926261] [<ffffffffa000567c>] tipc_bearer_xmit_skb+0x14c/0x2e0 [tipc] [ 56.926266] [<ffffffffa00055e0>] ? tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc] [ 56.926273] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc] [ 56.926278] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc] [ 56.926283] [<ffffffffa000c2d6>] disc_timeout+0x56/0xd0 [tipc] [ 56.926288] [<ffffffff810e7a68>] call_timer_fn+0xb8/0x340 [ 56.926291] [<ffffffff810e79b5>] ? call_timer_fn+0x5/0x340 [ 56.926296] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc] [ 56.926300] [<ffffffff810e8f4a>] run_timer_softirq+0x23a/0x390 [ 56.926306] [<ffffffff810f89ff>] ? clockevents_program_event+0x7f/0x130 [ 56.926316] [<ffffffff819727c3>] __do_softirq+0xc3/0x4a2 [ 56.926323] [<ffffffff8106ba5a>] irq_exit+0x8a/0xb0 [ 56.926327] [<ffffffff81972456>] smp_apic_timer_interrupt+0x46/0x60 [ 56.926331] [<ffffffff81970a49>] apic_timer_interrupt+0x89/0x90 [ 56.926333] <EOI> [<ffffffff81027fda>] ? default_idle+0x2a/0x1a0 [ 56.926340] [<ffffffff81027fd8>] ? default_idle+0x28/0x1a0 [ 56.926342] [<ffffffff810289cf>] arch_cpu_idle+0xf/0x20 [ 56.926345] [<ffffffff810adf0f>] default_idle_call+0x2f/0x50 [ 56.926347] [<ffffffff810ae145>] cpu_startup_entry+0x215/0x3e0 [ 56.926353] [<ffffffff81040ad9>] start_secondary+0xf9/0x100 The warning appears as rtnl_dereference() is wrongly used in tipc_l2_send_msg() under RCU read lock protection. Instead the proper usage should be that rcu_dereference_rtnl() is called here. Fixes: 5b7066c3dd24 ("tipc: stricter filtering of packets in bearer layer") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15tipc: add neighbor monitoring frameworkJon Paul Maloy
TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/sched/act_police.c net/sched/sch_drr.c net/sched/sch_hfsc.c net/sched/sch_prio.c net/sched/sch_red.c net/sched/sch_tbf.c In net-next the drop methods of the packet schedulers got removed, so the bug fixes to them in 'net' are irrelevant. A packet action unload crash fix conflicts with the addition of the new firstuse timestamp. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08tipc: change node timer unit from jiffies to msJon Paul Maloy
The node keepalive interval is recalculated at each timer expiration to catch any changes in the link tolerance, and stored in a field in struct tipc_node. We use jiffies as unit for the stored value. This is suboptimal, because it makes the calculation unnecessary complex, including two unit conversions. The conversions also lead to a rounding error that causes the link "abort limit" to be 3 in the normal case, instead of 4, as intended. This again leads to unnecessary link resets when the network is pushed close to its limit, e.g., in an environment with hundreds of nodes or namesapces. In this commit, we do instead let the keepalive value be calculated and stored in milliseconds, so that there is only one conversion and the rounding error is eliminated. We also remove a redundant "keepalive" field in struct tipc_link. This is remnant from the previous implementation. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08tipc: correct error in node fsmJon Paul Maloy
commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages when link is down") revealed a flaw in the node FSM, as defined in the log of commit 66996b6c47ed ("tipc: extend node FSM"). We see the following scenario: 1: Node B receives a RESET message from node A before its link endpoint is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This event will not change the node FSM state, but the (distinct) link FSM will move to state RESETTING. 2: As an effect of the previous event, the local endpoint on B will declare node A lost, and post the event SELF_DOWN to the its node FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning that no messages will be accepted from A until it receives another RESET message that confirms that A's endpoint has been reset. This is wasteful, since we know this as a fact already from the first received RESET, but worse is that the link instance's FSM has not wasted this information, but instead moved on to state ESTABLISHING, meaning that it repeatedly sends out ACTIVATE messages to the reset peer A. 3: Node A will receive one of the ACTIVATE messages, move its link FSM to state ESTABLISHED, and start repeatedly sending out STATE messages to node B. 4: Node B will consistently drop these messages, since it can only accept accept a RESET according to its node FSM. 5: After four lost STATE messages node A will reset its link and start repeatedly sending out RESET messages to B. 6: Because of the reduced send rate for RESET messages, it is very likely that A will receive an ACTIVATE (which is sent out at a much higher frequency) before it gets the chance to send a RESET, and A may hence quickly move back to state ESTABLISHED and continue sending out STATE messages, which will again be dropped by B. 7: GOTO 5. 8: After having repeated the cycle 5-7 a number of times, node A will by chance get in between with sending a RESET, and the situation is resolved. Unfortunately, we have seen that it may take a substantial amount of time before this vicious loop is broken, sometimes in the order of minutes. We correct this by making a small correction to the node FSM: When a node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't need to wait for RESET confirmation from of an endpoint that we alread know has been reset. It also means that node B in the scenario above will not be dropping incoming STATE messages, and the link can come up immediately. Finally, a symmetry comparison reveals that the FSM has a similar error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING. Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect of this logical error, we choose fix this one, too. The node FSM looks as follows after those changes: +----------------------------------------+ | PEER_DOWN_EVT| | | +------------------------+----------------+ | |SELF_DOWN_EVT | | | | | | | | +-----------+ +-----------+ | | |NODE_ | |NODE_ | | | +----------|FAILINGOVER|<---------|SYNCHING |-----------+ | | |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | | | |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| | | | | | | | | | | | | | | | | | | | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | | | | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | | | | | | | | | | | | | | | | | | | | | +--------------+ | | | | | +-------->| SELF_UP_ |<-------+ | | | | +-----------------| PEER_UP |----------------+ | | | | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | | | | | A A | | | | | | | | | | | | | | PEER_UP_EVT| |SELF_UP_EVT | | | | | | | | | | | V V V | | V V V +------------+ +-----------+ +-----------+ +------------+ |SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN | |PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING| +------------+ +-----------+ +-----------+ +------------+ | | A A | | | | | | | | | SELF_ | |SELF_ |PEER_ |PEER_ | | DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT | | | | | | | | | | | | | | | +--------------+ | | |PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT| +------------------->| PEER_DOWN |<--------------------+ +--------------+ Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02tipc: fix an infoleak in tipc_nl_compat_link_dumpKangjie Lu
link_info.str is a char array of size 60. Memory after the NULL byte is not initialized. Sending the whole object out can cause a leak. Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-25tipc: fix potential null pointer dereferences in some compat functionsBaozeng Ding
Before calling the nla_parse_nested function, make sure the pointer to the attribute is not null. This patch fixes several potential null pointer dereference vulnerabilities in the tipc netlink functions. Signed-off-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-19tipc: block BH in TCP callbacksEric Dumazet
TCP stack can now run from process context. Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous assumption. Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits) gitignore: fix wording mfd: ab8500-debugfs: fix "between" in printk memstick: trivial fix of spelling mistake on management cpupowerutils: bench: fix "average" treewide: Fix typos in printk IB/mlx4: printk fix pinctrl: sirf/atlas7: fix printk spelling serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/ w1: comment spelling s/minmum/minimum/ Blackfin: comment spelling s/divsor/divisor/ metag: Fix misspellings in comments. ia64: Fix misspellings in comments. hexagon: Fix misspellings in comments. tools/perf: Fix misspellings in comments. cris: Fix misspellings in comments. c6x: Fix misspellings in comments. blackfin: Fix misspelling of 'register' in comment. avr32: Fix misspelling of 'definitions' in comment. treewide: Fix typos in printk Doc: treewide : Fix typos in DocBook/filesystem.xml ...
2016-05-17tipc: fix nametable publication field in nl compatRichard Alpe
The publication field of the old netlink API should contain the publication key and not the publication reference. Fixes: 44a8ae94fd55 (tipc: convert legacy nl name table dump to nl compat) Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16tipc: check nl sock before parsing nested attributesRichard Alpe
Make sure the socket for which the user is listing publication exists before parsing the socket netlink attributes. Prior to this patch a call without any socket caused a NULL pointer dereference in tipc_nl_publ_dump(). Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.cm> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12tipc: eliminate risk of double link_up eventsJon Paul Maloy
When an ACTIVATE or data packet is received in a link in state ESTABLISHING, the link does not immediately change state to ESTABLISHED, but does instead return a LINK_UP event to the caller, which will execute the state change in a different lock context. This non-atomic approach incurs a low risk that we may have two LINK_UP events pending simultaneously for the same link, resulting in the final part of the setup procedure being executed twice. The only potential harm caused by this it that we may see two LINK_UP events issued to subsribers of the topology server, something that may cause confusion. This commit eliminates this risk by checking if the link is already up before proceeding with the second half of the setup. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/ipv4/ip_gre.c Minor conflicts between tunnel bug fixes in net and ipv6 tunnel cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: redesign connection-level flow controlJon Paul Maloy
There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: propagate peer node capabilities to socket layerJon Paul Maloy
During neighbor discovery, nodes advertise their capabilities as a bit map in a dedicated 16-bit field in the discovery message header. This bit map has so far only be stored in the node structure on the peer nodes, but we now see the need to keep a copy even in the socket structure. This commit adds this functionality. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: re-enable compensation for socket receive buffer double countingJon Paul Maloy
In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers in socket in separate function") we did by accident replace the test if (sk->sk_backlog.len == 0) atomic_set(&tsk->dupl_rcvcnt, 0); with if (sk->sk_backlog.len) atomic_set(&tsk->dupl_rcvcnt, 0); This effectively disables the compensation we have for the double receive buffer accounting that occurs temporarily when buffers are moved from the backlog to the socket receive queue. Until now, this has gone unnoticed because of the large receive buffer limits we are applying, but becomes indispensable when we reduce this buffer limit later in this series. We now fix this by inverting the mentioned condition. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01tipc: only process unicast on intended nodeHamish Martin
We have observed complete lock up of broadcast-link transmission due to unacknowledged packets never being removed from the 'transmq' queue. This is traced to nodes having their ack field set beyond the sequence number of packets that have actually been transmitted to them. Consider an example where node 1 has sent 10 packets to node 2 on a link and node 3 has sent 20 packets to node 2 on another link. We see examples of an ack from node 2 destined for node 3 being treated as an ack from node 2 at node 1. This leads to the ack on the node 1 to node 2 link being increased to 20 even though we have only sent 10 packets. When node 1 does get around to sending further packets, none of the packets with sequence numbers less than 21 are actually removed from the transmq. To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc: reduce locking scope during packet reception") which ensures that only messages destined for the receiving node are processed by that node. This prevents the sequence numbers from getting out of sync and resolves the packet leakage, thereby resolving the broadcast-link transmission lock-ups we observed. While we are aware that this change only patches over a root problem that we still haven't identified, this is a sanity test that it is always legitimate to do. It will remain in the code even after we identify and fix the real problem. Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01tipc: set 'active' state correctly for first established linkJon Paul Maloy
When we are displaying statistics for the first link established between two peers, it will always be presented as STANDBY although it in reality is ACTIVE. This happens because we forget to set the 'active' flag in the link instance at the moment it is established. Although this is a bug, it only has impact on the presentation view of the link, not on its actual functionality. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28tipc: remove an unnecessary NULL checkDan Carpenter
This is never called with a NULL "buf" and anyway, we dereference 's' on the lines before so it would Oops before we reach the check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tipc: fix stale links after re-enabling bearerParthasarathy Bhuvaragan
Commit 42b18f605fea ("tipc: refactor function tipc_link_timeout()"), introduced a bug which prevents sending of probe messages during link synchronization phase. This leads to hanging links, if the bearer is disabled/enabled after links are up. In this commit, we send the probe messages correctly. Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18treewide: Fix typos in printkMasanari Iida
This patch fix spelling typos found in printk within various part of the kernel sources. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-15tipc: let first message on link be a state messageJon Paul Maloy
According to the link FSM, a received traffic packet can take a link from state ESTABLISHING to ESTABLISHED, but the link can still not be fully set up in one atomic operation. This means that even if the the very first packet on the link is a traffic packet with sequence number 1 (one), it has to be dropped and retransmitted. This can be avoided if we let the mentioned packet be preceded by a LINK_PROTOCOL/STATE message, which takes up the endpoint before the arrival of the traffic. We add this small feature in this commit. This is a fully compatible change. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: ensure that first packets on link are sent in orderJon Paul Maloy
In some link establishment scenarios we see that packet #2 may be sent out before packet #1, forcing the receiver to demand retransmission of the missing packet. This is harmless, but may cause confusion among people tracing the packet flow. Since this is extremely easy to fix, we do so by adding en extra send call to the bearer immediately after the link has come up. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: refactor function tipc_link_timeout()Jon Paul Maloy
The function tipc_link_timeout() is unnecessary complex, and can easily be made more readable. We do that with this commit. The only functional change is that we remove a redundant test for whether the broadcast link is up or not. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: reduce transmission rate of reset messages when link is downJon Paul Maloy
When a link is down, it will continuously try to re-establish contact with the peer by sending out a RESET or an ACTIVATE message at each timeout interval. The default value for this interval is currently 375 ms. This is wasteful, and may become a problem in very large clusters with dozens or hundreds of nodes being down simultaneously. We now introduce a simple backoff algorithm for these cases. The first five messages are sent at default rate; thereafter a message is sent only each 16th timer interval. This will cover the vast majority of link recycling cases, since the endpoint starting last will transmit at the higher speed, and the link should normally be established well be before the rate needs to be reduced. The only case where we will see a degradation of link re-establishment times is when the endpoints remain intact, and a glitch in the transmission media is causing the link reset. We will then experience a worst-case re-establishing time of 6 seconds, something we deem acceptable. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tipc: guarantee peer bearer id exchange after rebootJon Paul Maloy
When a link endpoint is going down locally, e.g., because its interface is being stopped, it will spontaneously send out a RESET message to its peer, informing it about this fact. This saves the peer from detecting the failure via probing, and hence gives both speedier and less resource consuming failure detection on the peer side. According to the link FSM, a receiver of a RESET message, ignoring the reason for it, must now consider the sender ready to come back up, and starts periodically sending out ACTIVATE messages to the peer in order to re-establish the link. Also, according to the FSM, the receiver of an ACTIVATE message can now go directly to state ESTABLISHED and start sending regular traffic packets. This is a well-proven and robust FSM. However, in the case of a reboot, there is a small possibilty that link endpoint on the rebooted node may have been re-created with a new bearer identity between the moment it sent its (pre-boot) RESET and the moment it receives the ACTIVATE from the peer. The new bearer identity cannot be known by the peer according to this scenario, since traffic headers don't convey such information. This is a problem, because both endpoints need to know the correct value of the peer's bearer id at any moment in time in order to be able to produce correct link events for their users. The only way to guarantee this is to enforce a full setup message exchange (RESET + ACTIVATE) even after the reboot, since those messages carry the bearer idientity in their header. In this commit we do this by introducing and setting a "stopping" bit in the header of the spontaneously generated RESET messages, informing the peer that the sender will not be immediately ready to re-establish the link. A receiver seeing this bit must act as if this were a locally detected connectivity failure, and hence has to go through a full two- way setup message exchange before any link can be re-established. Although never reported, this problem seems to have always been around. This protocol addition is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14tipc: fix a race condition leading to subscriber refcnt bugParthasarathy Bhuvaragan
Until now, the requests sent to topology server are queued to a workqueue by the generic server framework. These messages are processed by worker threads and trigger the registered callbacks. To reduce latency on uniprocessor systems, explicit rescheduling is performed using cond_resched() after MAX_RECV_MSG_COUNT(25) messages. This implementation on SMP systems leads to an subscriber refcnt error as described below: When a worker thread yields by calling cond_resched() in a SMP system, a new worker is created on another CPU to process the pending workitem. Sometimes the sleeping thread wakes up before the new thread finishes execution. This breaks the assumption on ordering and being single threaded. The fault is more frequent when MAX_RECV_MSG_COUNT is lowered. If the first thread was processing subscription create and the second thread processing close(), the close request will free the subscriber and the create request oops as follows: [31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97 [31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc] [...] [31.228377] Call Trace: [31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72 [31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0 [31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20 [31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc] [31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc] [31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc] [31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0 [31.246554] ---[ end trace c3882c9baa05a4fd ]--- [31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266 [31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428 [31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0 [31.249323] PGD 0 [31.249323] Oops: 0000 [#1] SMP In this commit, we - rename tipc_conn_shutdown() to tipc_conn_release(). - move connection release callback execution from tipc_close_conn() to a new function tipc_sock_release(), which is executed before we free the connection. Thus we release the subscriber during connection release procedure rather than connection shutdown procedure. Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13tipc: remove remnants of old broadcast codeJon Paul Maloy
We remove a couple of leftover fields in struct tipc_bearer. Those were used by the old broadcast implementation, and are not needed any longer. There is no functional changes in this commit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11tipc: purge deferred updates from dead nodesErik Hugne
If a peer node becomes unavailable, in addition to removing the nametable entries from this node we also need to purge all deferred updates associated with this node. Signed-off-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11tipc: make dist queue pernetErik Hugne
Nametable updates received from the network that cannot be applied immediately are placed on a defer queue. This queue is global to the TIPC module, which might cause problems when using TIPC in containers. To prevent nametable updates from escaping into the wrong namespace, we make the queue pernet instead. Signed-off-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07tipc: stricter filtering of packets in bearer layerJon Paul Maloy
Resetting a bearer/interface, with the consequence of resetting all its pertaining links, is not an atomic action. This becomes particularly evident in very large clusters, where a lot of traffic may happen on the remaining links while we are busy shutting them down. In extreme cases, we may even see links being re-created and re-established before we are finished with the job. To solve this, we now introduce a solution where we temporarily detach the bearer from the interface when the bearer is reset. This inhibits all packet reception, while sending still is possible. For the latter, we use the fact that the device's user pointer now is zero to filter out which packets can be sent during this situation; i.e., outgoing RESET messages only. This filtering serves to speed up the neighbors' detection of the loss event, and saves us from unnecessary probing. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07tipc: eliminate buffer leak in bearer layerJon Paul Maloy
When enabling a bearer we create a 'neigbor discoverer' instance by calling the function tipc_disc_create() before the bearer is actually registered in the list of enabled bearers. Because of this, the very first discovery broadcast message, created by the mentioned function, is lost, since it cannot find any valid bearer to use. Furthermore, the used send function, tipc_bearer_xmit_skb() does not free the given buffer when it cannot find a bearer, resulting in the leak of exactly one send buffer each time a bearer is enabled. This commit fixes this problem by introducing two changes: 1) Instead of attemting to send the discovery message directly, we let tipc_disc_create() return the discovery buffer to the calling function, tipc_enable_bearer(), so that the latter can send it when the enabling sequence is finished. 2) In tipc_bearer_xmit_skb(), as well as in the two other transmit functions at the bearer layer, we now free the indicated buffer or buffer chain when a valid bearer cannot be found. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14tipc: make sure IPv6 header fits in skb headroomRichard Alpe
Expand headroom further in order to be able to fit the larger IPv6 header. Prior to this patch this caused a skb under panic for certain tipc packets when using IPv6 UDP bearer(s). Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11ip_tunnel: add support for setting flow label via collect metadataDaniel Borkmann
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label from call sites. Currently, there's no such option and it's always set to zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so that flow-based tunnels via collect metadata frontends can make use of it. vxlan and geneve will be converted to add flow label support separately. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of overlapping changes, as well as one instance (vxlan) of a bug fix in 'net' overlapping with code movement in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07tipc: move netlink policies to netlink.cRichard Alpe
Make the c files less cluttered and enable netlink attributes to be shared between files. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: remove pre-allocated message header in link structJon Paul Maloy
Until now, we have kept a pre-allocated protocol message header aggregated into struct tipc_link. Apart from adding unnecessary footprint to the link instances, this requires extra code both to initialize and re-initialize it. We now remove this sub-optimization. This change also makes it possible to clean up the function tipc_build_proto_msg() and remove a couple of small functions that were accessing the mentioned header. In particular, we can replace all occurrences of the local function call link_own_addr(link) with the generic tipc_own_addr(net). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: fix nullptr crash during subscription cancelParthasarathy Bhuvaragan
commit 4d5cfcba2f6e ('tipc: fix connection abort during subscription cancel'), removes the check for a valid subscription before calling tipc_nametbl_subscribe(). This will lead to a nullptr exception when we process a subscription cancel request. For a cancel request, a null subscription is passed to tipc_nametbl_subscribe() resulting in exception. In this commit, we call tipc_nametbl_subscribe() only for a valid subscription. Fixes: 4d5cfcba2f6e ('tipc: fix connection abort during subscription cancel') Reported-by: Anders Widell <anders.widell@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: make sure required IPv6 addresses are scopedRichard Alpe
Make sure the user has provided a scope for multicast and link local addresses used locally by a UDP bearer. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: safely copy UDP netlink data from userRichard Alpe
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE is of type binary with a defined length. This causes the policy framework to threat the defined length as maximum length. There is however no protection against a user sending a smaller amount of data. Prior to this patch this wasn't handled which could result in a partially incomplete sockaddr_storage struct containing uninitialized data. In this patch we use nla_memcpy() when copying the user data. This ensures a potential gap at the end is cleared out properly. This was found by Julia with Coccinelle tool. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: don't check link reset on non existing linkRichard Alpe
Make sure we have a link before checking if it has been reset or not. Prior to this patch tipc_link_is_reset() could be called with a non existing link, resulting in a null pointer dereference. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06tipc: add net device to skb before UDP xmitRichard Alpe
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer dereference in iptunnel_xmit_stats(), when it tried to dereference the net device from the skb. To resolve this we now point the skb device to the net device resolved from the routing table. Fixes: 039f50629b7f (ip_tunnel: Move stats update to iptunnel_xmit()) Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain"Parthasarathy Bhuvaragan
reverts commit 94153e36e709e ("tipc: use existing sk_write_queue for outgoing packet chain") In Commit 94153e36e709e, we assume that we fill & empty the socket's sk_write_queue within the same lock_sock() session. This is not true if the link is congested. During congestion, the socket lock is released while we wait for the congestion to cease. This implementation causes a nullptr exception, if the user space program has several threads accessing the same socket descriptor. Consider two threads of the same program performing the following: Thread1 Thread2 -------------------- ---------------------- Enter tipc_sendmsg() Enter tipc_sendmsg() lock_sock() lock_sock() Enter tipc_link_xmit(), ret=ELINKCONG spin on socket lock.. sk_wait_event() : release_sock() grab socket lock : Enter tipc_link_xmit(), ret=0 : release_sock() Wakeup after congestion lock_sock() skb = skb_peek(pktchain); !! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong; In this case, the second thread transmits the buffers belonging to both thread1 and thread2 successfully. When the first thread wakeup after the congestion it assumes that the pktchain is intact and operates on the skb's in it, which leads to the following exception: [2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 [2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc] [2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0 [2102.440074] Oops: 0000 [#1] SMP [2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1 [2102.440074] RIP: 0010:[<ffffffffa005f330>] [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc] [...] [2102.440074] Call Trace: [2102.440074] [<ffffffff8163f0b9>] ? schedule+0x29/0x70 [2102.440074] [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc] [2102.440074] [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc] [2102.440074] [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc] [2102.440074] [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20 [2102.440074] [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc] [2102.440074] [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0 [2102.440074] [<ffffffff81507895>] ? release_sock+0x145/0x170 [2102.440074] [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0 [2102.440074] [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10 [2102.440074] [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0 [2102.440074] [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0 [2102.440074] [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20 [2102.440074] [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0 [2102.440074] [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950 [2102.440074] [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80 [2102.440074] [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20 [2102.440074] [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b In this commit, we maintain the skb list always in the stack. Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>