summaryrefslogtreecommitdiffstats
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)Author
2021-01-21habanalabs: disable FW events on device removalOded Gabbay
When device is removed, we need to make sure the F/W won't send us any more events because during the remove process we disable the interrupts. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21habanalabs: fix backward compatibility of idle checkOded Gabbay
Need to take the lower 32 bits of the driver's 64-bit idle mask and put it in the legacy 32-bit variable that the userspace reads to know the idle mask. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21habanalabs: zero pci counters packet before submit to FWOfir Bitton
Driver does not zero some pci counters packets before sending to FW. This causes an out of sync PI/CI between driver and FW. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12habanalabs: prevent soft lockup during unmapOded Gabbay
When using Deep learning framework such as tensorflow or pytorch, there are tens of thousands of host memory mappings. When the user frees all those mappings at the same time, the process of unmapping and unpinning them can take a long time, which may cause a soft lockup bug. To prevent this, we need to free the core to do other things during the unmapping process. For now, we chose to do it every 32K unmappings (each unmap is a single 4K page). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12habanalabs: fix reset process in case of failuresOded Gabbay
There are some points in the reset process where if the code fails for some reason, and the system admin tries to initiate the reset process again we will get a kernel panic. This is because there aren't any protections in different fini functions that are called during the reset process. The protections that are added in this patch make sure that if the fini functions are called multiple times, without calling init functions between them, there won't be double release of already released resources. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12habanalabs: fix dma_addr passed to dma_mmap_coherentOded Gabbay
When doing dma_alloc_coherent in the driver, we add a certain hard-coded offset to the DMA address before returning to the callee function. This offset is needed when our device use this DMA address to perform outbound transactions to the host. However, if we want to map the DMA'able memory to the user via dma_mmap_coherent(), we need to pass the original dma address, without this offset. Otherwise, we will get erronouos mapping. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-29habanalabs: Fix memleak in hl_device_resetDinghao Liu
When kzalloc() fails, we should execute hl_mmu_fini() to release the MMU module. It's the same when hl_ctx_init() fails. Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: fix order of status checkOded Gabbay
When the device is in reset or needs to be reset, the disabled property is don't-care. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: register to pci shutdown callbackOded Gabbay
We need to make sure our device is idle when rebooting a virtual machine. This is done in the driver level. The firmware will later handle FLR but we want to be extra safe and stop the devices until the FLR is handled. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: add validation cs counter, fix misplaced countersAlon Mizrahi
Up until now validation errors were counted in the parsing field of the cs_counters struct, so we added a new counter and increased it when needed. In addition, there were some locations where only one of the counters was updated (ctx or aggregate) so add the second one to be updated as well. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs/gaudi: retry loading TPC f/w on -EINTROded Gabbay
If loading the firmware file for the TPC f/w was interrupted, try to do it again, up to 5 times. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: adjust pci controller init to new firmwareOded Gabbay
When the firmware security is enabled, the pcie_aux_dbi_reg_addr register in the PCI controller is blocked. Therefore, ignore the result of writing to this register and assume it worked. Also remove the prints on errors in the internal ELBI write function. If the security is enabled, the firmware is responsible for setting this register correctly so we won't have any problem. If the security is disabled, the write will work (unless something is totally broken at the PCI level and then the whole sequence will fail). In addition, remove a write to register pcie_aux_dbi_reg_addr+4, which was never actually needed. Moreover, PCIE_DBI registers are blocked to access from host when firmware security is enabled. Use a different register to flush the writes. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: update comment in hl_boot_if.hOded Gabbay
Hard-reset flag is updated in many stages of the boot sequence of the firmware. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs/gaudi: enhance reset messageOded Gabbay
Print the initiator who performs the hard-reset for easier debugging. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: full FW hard reset supportOfir Bitton
Driver must fetch FW hard reset capability at every FW boot stage: preboot, CPU boot, CPU application. If hard reset is triggered, driver will take into consideration only the last capability received. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs/gaudi: disable CGM at HW initializationOded Gabbay
In case the clock gating was enabled in preboot we need to disable it at the H/W initialization stage before touching the MME/TPC registers. Otherwise, the ASIC can get stuck. If the security is enabled in the firmware level, the CGM is always disabled and the driver can't enable it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: Revise comment to align with mirror list nameTomer Tayar
hw_queues_mirror was renamed to cs_mirror, so revise accordingly a comment that refers to this list. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs/gaudi: do not set EB in collective slave queuesAlon Mizrahi
We don't need to set EB on signal packets from collective slave queues as it degrades performance. Because the slaves are the network queues, the engine barrier doesn't actually guarantee that the packet has been sent. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: preboot hard reset supportOfir Bitton
FW hard reset capability indication is now moved to preboot stage. Driver will check if HW is dirty only after it validated preboot is up. If HW is dirty, driver will perform a hard reset according to the FW capability. In addition, FW defines a new message which driver need to send in order to initiate a hard reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: remove generic gaudi get_pll_freq functionAlon Mizrahi
As we only fetch the CPU_PLL frequency in gaudi, we don't need a generic get_pll_frequency function which takes a pll index as input Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: fetch PSOC PLL frequency from F/W in goyaAlon Mizrahi
When the F/W security is enabled, goya needs to fetch the PSOC pll frequency through a dedicated interface Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28habanalabs: Fix a missing-braces warningTomer Tayar
Fix a compilation "missing braces around initializer" warning. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-07Merge 5.10-rc7 into char-misc-nextGreg Kroah-Hartman
We want the fixes in here, and this resolves a merge issue with drivers/misc/habanalabs/common/memory.c. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-30habanalabs: Add CB IOCTL opcode to retrieve CB informationTomer Tayar
Add a new CB IOCTL opcode that enables a user to query about a CB and get its usage count. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Modify the cs_cnt of a CB to be atomicTomer Tayar
Modify the CS counter of a CB to be atomic, so no locking is required when it is being modified or read. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Add mask for CS type bits in CS flagsTomer Tayar
hl_cs_sanity_checks() extracts the CS type bits of the CS flags, by masking out the non-type bits. To save the need for updating the function whenever new bits for non-type flags are added, add an explicit mask for the CS type bits. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: change messages to debug levelOded Gabbay
Some messages should be changed to debug mode as we want to keep minimal prints during normal operation of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: free host huge va_range if not usedOfir Bitton
If huge range is not valid, driver uses the host range also for huge page allocations, but driver never frees its allocation. This introduces a memory leak every time a user closes its context. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: handle reset when f/w is in prebootOded Gabbay
Currently, if the f/w is in preboot/u-boot they don't perform the new reset mechanism. Therefore, the driver needs to reset the device. To prevent reset of PCI_IF, the driver needs to first configure the reset units. If the security is enabled, the driver can't configure the reset units. In that situation, don't reset the card. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add missing counter updateOded Gabbay
The global CS drop-on-reset counter wasn't updated together with the context counter. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add ull to PLL masksAlon Mizrahi
These defines are 64-bit defines so they need ull suffix. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add support for cs with timestampOfir Bitton
add support for user to request a timestamp upon cs completion. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: indicate to user that a cs is goneOfir Bitton
We want to indicate to the user that a certain command submission is finished long time ago and it is no longer in database. This means no further information regarding this cs can be obtained. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: print ECC type fieldOded Gabbay
We have the ECC type field from the firmware but the driver didn't print it, so we need to add that field to the ECC print message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: update firmware filesOded Gabbay
Update various firmware header files with new defines. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: gaudi_ctx_fini() can be statickernel test robot
Make a function in gaudi.c to be static Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: goya_reset_sob_group() can be statickernel test robot
Make some functions static Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: kernel test robot <lkp@intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fetch pll frequency from firmwareAlon Mizrahi
Once firmware security is enabled, driver must fetch pll frequencies through the firmware message interface instead of reading the registers directly. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: mmu map wrapper for sizes larger than a pageOfir Bitton
We introduce a new wrapper which allows us to mmu map any size to any host va_range available. In addition we remove duplicated code from various places in driver and using this new wrapper instead. This wrapper supports mapping only contiguous physical memory blocks and will be used for mappings that are done to the driver ASID. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: print CS type when it is stuckOded Gabbay
We have several types of command submissions and the user wants to know which type of command submission has not finished in time when that event occurs. This is very helpful for debug. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: align to new FW reset schemeOfir Bitton
As part of the security effort in which FW will be handling sensitive HW registers, hard reset flow will be done by FW and will be triggered by driver. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: firmware returns 64bit argumentAlon Mizrahi
F/W message returns 64bit value but up until now we casted it to a 32bit variable, instead of receiving 64bit in the first place. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fix MMU debugfs operationsMoti Haimovski
After the MMU-code refactoring, the existing MMU debugfs operations are no longer working so we need to fix them. In addition, remove the duplicate code that was in the debugfs code and use the already existing MMU-code. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: share a single ctx-mutex between all MMUsMoti Haimovski
Multiple locks are usually a source of problems, which in the MMU case can be avoided since it is relatively rare that both MMU tables are updated at the same time. Therefore, use a single shared lock instead of two separate ones. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: support reserving aligned va blockOfir Bitton
Add support for reserving va block with alignment different than page size. This is a pre-requisite for allocations needed in future ASICs Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add boot errors printsGuy Nisan
Add log prints for security and eFuse boot error bits Signed-off-by: Guy Nisan <gnisan@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: print message with correct deviceOded Gabbay
During hard-reset, the driver rejects further IOCTL calls and prints an error message. That error message should be printed with the correct device instead of using only the control device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: fetch HBM ecc info from FWOfir Bitton
Once FW security is enabled there is no access to HBM ecc registers, need to read values from FW using a dedicated interface. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fetch hard reset capability from FWOfir Bitton
Driver must fetch FW hard reset capability during boot time, in order to skip the hard reset flow if necessary. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: move asic property to correct structureOded Gabbay
Whether an ASIC has MMU towards its DRAM is an ASIC property, so move it to the asic fixed properties structure. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>