You might look at the MD5CRK project, http://www.md5crk.com. They have some optimized clients for MD5, and SHA is pretty similar, isn't it? See in particular http://www.md5crk.com/?sec=aboutmd5client which indicates that their client will use the MMX and SSE units, as well as dual processor systems. It actually tries several different versions, self-benchmarks them, and then chooses the fastest one. I don't know if they have source code available but if you explained your goals they might be willing to help you out.