Optimized parallel CUDA scan algorithm to avoid unnecessary writing in the first phase
The original approach (prescan + uniform shift) is more efficient for inputs that are expensive to evaluate, such as vector expressions.
Loading
Please sign in to comment