This paper describes the algorithm and design tradeoffs for multiple hardware implementations of parallel high-radix scalable Montgomery multipliers. Hardware implementations of Montgomery multipliers require choosing a radix, shift direction, and whether to use Booth encoding. Presented are processing element designs exploring combinations of radices 2, 4, and 8, right vs. left shifting, and Booth encoding. A radix-4, left-shifting, non-Booth encoded design performs a 1024-bit modular exponentiation in 9.4 ms using 4997 LUTs and 4051 REGs and appears to maximize performance/hardware in an FPGA implementation. A Booth encoded version of the above multiplier performs a 1024-bit modular exponentiation in 13 ms using 4852 LUTs and 2887 REGs. This design may be beneficial for systems constrained by the cycle time of other elements because the design minimizes hardware usage and requires no precomputed multiples. The radix-8, right-shifting, Booth-encoded design offers no performance/hardware advantage over a comparable radix-4 design.