Well, what you describe is pretty much how Alpha does it. MIPS does it differently.
Alpha calls this instruction ldq_u (u for unaligned).
Alpha uses three instructions for that. Two extq instructions for shifting and masking r1 and r2, and an or instruction to combine the results. Overall an unaligned load looks like this on Alpha:
lda at,0(t0) ldq_u t9,0(at) ldq_u t10,7(at) extql t9,at,t9 extqh t10,at,t10 or t9,t10,t3
The lda (for computing the effective address) could be optimized away in nearly all cases, but that effort was apparently not expended by gas. It is interesting that the offset for the second ldq_u is 7, not
8 (and the extqh must match that). My guess is that this is done so that you do not get an exception when you use this sequence for loading the last word of a page with an aligned address.Hmm, this requires two instructions, which are just used for this purpose AFAIK: extqh and extql (ldq_u is also used for byte loads etc. on the Alpha). How much longer would the sequence be if we allowed only one 2-in-1-out special-purpose instruction, or none (but slightly more general-purpose shift-and-mask-byte instructions)?
I can see how to do it with one less instruction with two special-purpose instructions: extqh does not need to set the low-order byte (this can be covered by extql in every case), so it could store the low-order bits of the address there. Then extql could be modified to take the result of extqh instead of the address, and perform the merge. The sequence would look like:
lda at,0(t0) #can be optimized away ldq_u t9,0(at) ldq_u t10,7(at) extqhx t10,at,t10 extqlor t9,t10,t3
This probably would have required additional muxes in the data path, though.
Followups set to comp.arch.
- anton