This paper describes a parallelized radix-4 scalable Montgomery multiplier implementation. The design does not require hardware multipliers, and uses parallelized multiplication to shorten the critical path. By left-shifting the sources rather than right-shifting the result, the latency between processing elements is shortened from two cycles to nearly one. The new design can perform 1024-bit modular exponentiation in 8.7 ms and 256-bit exponentiation in 0.36 ms using 5916 Virtex2 4-input lookup tables. This is comparable to radix-2 for long multiplies and nearly twice as fast for short ones.