www.riscos.com Technical Support:
Acorn Assembler


Example assembler fragments

The following example assembly language fragments show ways in which the basic ARM instructions can combine to give efficient code. None of the techniques illustrated save a great deal of execution time (although they all save some), mostly they just save code.

Note that, when optimising code for execution speed, consideration to different hardware bases should be given. Some changes which optimise speed on one machine may slow the code on another. An example is unrolling loops (eg divide loops) which speeds execution on an ARM2, but can slow execution on an ARM3, which has a cache.

Using the conditional instructions

Using conditionals for logical OR

        CMP     Rn,#p           ; IF Rn=p OR Rm=q THEN GOTO Label
        BEQ     Label
        CMP     Rm,#q
        BEQ     Label

can be replaced by:

        CMP     Rn,#p
        CMPNE   Rm,#q           ; If condition not satisfied try
        BEQ     Label           ;  another test.

Absolute value

        TEQ     Rn,#0           ; Test sign
        RSBMI   Rn,Rn,#0        ; and 2's complement if necessary.

Combining discrete and range tests

        TEQ     Rc,#127         ; discrete test
        CMPNE   Rc,#" "-1       ; range test
        MOVLS   Rc,#"."         ; IF Rc<#" " OR Rc=CHR$127 THEN Rc:="."

Division and remainder

; Enter with dividend in Ra, divisor in Rb.
; Divisor must not be zero.
        MOV     Rd,Rb           ; Put the divisor in Rd.
        CMP     Rd,Ra,LSR #1    ; Then double it until
Div1    MOVLS   Rd,Rd,LSL #1    ; 2 * Rd > divisor.
        CMP     Rd,Ra,LSR #1
        BLS     Div1
        MOV     Rc,#0           ; Initialise the quotient
Div2    CMP     Ra,Rd           ; Can we subtract Rd?
        SUBCS   Ra,Ra,Rd        ; If we can, do so
        ADC     Rc,Rc,Rc        ; Double quotient and add new bit
        MOV     Rd,Rd,LSR #1    ; Halve Rd.
        CMP     Rd,Rb           ; And loop until we've gone
        BHS     Div2            ; past the original divisor,
; Now Ra holds remainder, Rb holds original divisor,
; Rc holds quotient and Rd holds junk.

Pseudo-random binary sequence generator

It is often necessary to generate (pseudo-) random numbers, and the most efficient algorithms are based on shift generators with a feedback rather like a cyclic redundancy check generator. Unfortunately, the sequence of a 32 bit generator needs more than one feedback tap to be maximal length (that is, 232-1 cycles before repetition). A 33 bit shift generator with taps at bits 20 and 33 is required.

The basic algorithm is:

  • new bit := bit 33 EOR bit 20
  • shift left the 33 bit number
  • put in new bit at the bottom.
  • Repeat for all the 32 new bits needed.

All this can be done in five S cycles:

; Enter with seed in Ra (32 bits),Rb (1 bit in Rb lsb)
; Uses Rc
        TST     Rb,Rb,LSR #1    ; top bit into carry
        MOVS    Rc,Ra,RRX       ; 33 bit rotate right
        ADC     Rb,Rb,Rb        ; carry into lsb of Rb
        EOR     Rc,Rc,Ra,LSL#12 ; (involved!)
        EOR     Ra,Rc,Rc,LSR#20 ; (similarly involved!)
; New seed in Ra, Rb as before

Multiplication by a constant

Multiplication by 2n (1,2,4,8,16,32...)

        MOV     Ra,Ra,LSL #n

Multiplication by 2n+1 (3,5,9,17...)

        ADD     Ra,Ra,Ra,LSL #n

Multiplication by 2n-1 (3,7,15...)

        RSB     Ra,Ra,Ra,LSL #n

Multiplication by 6

        ADD     Ra,Ra,Ra,LSL #1 ; Multiply by 3
        MOV     Ra,Ra,LSL #1    ; and then by 2.

Multiply by 10 and add in extra number

        ADD     Ra,Ra,Ra,LSL #2 ; Multiply by 5
        ADD     Ra,Rc,Ra,LSL #1 ; Multiply by 2 and add in next digit

General recursive method for Rb := Ra×C, C a constant

If C even, say C = 2n×D, D odd:

D=1 :   MOV     Rb,Ra,LSL #n
D<>1:   {Rb := Ra*D}
        MOV     Rb,Rb,LSL #n

If C MOD 4 = 1, say C = 2n×D+1, D odd, n>1:

D=1 :   ADD     Rb,Ra,Ra,LSL #n
D<>1:   {Rb := Ra*D}
        ADD     Rb,Ra,Rb,LSL #n

If C MOD 4 = 3, say C = 2n×D-1, D odd, n>1:

D=1 :   RSB     Rb,Ra,Ra,LSL #n
D<>1:   {Rb := Ra*D}
        RSB     Rb,Ra,Rb,LSL #n

This is not quite optimal, but close. An example of its non-optimal use is multiply by 45 which is done by:

        RSB     Rb,Ra,Ra,LSL #2 ; Multiply by 3
        RSB     Rb,Ra,Rb,LSL #2 ; Multiply by 4*3-1 = 11
        ADD     Rb,Ra,Rb,LSL #2 ; Multiply by 4*11+1 = 45

rather than by:

        ADD     Rb,Ra,Ra,LSL #3 ; Multiply by 9
        ADD     Rb,Rb,Rb,LSL #2 ; Multiply by 5*9 = 45

Loading a word from an unknown alignment

There is no instruction to load a word from an unknown alignment. To do this requires some code (which can be a macro) along the following lines:

; Enter with 32-bit address in Ra
; Uses Rb, Rc; result in Rd
; Note d must be less than c

        BIC     Rb,Ra,#3        ; Get word-aligned address
        LDMIA   Rb,{Rd,Rc}      ; Get 64 bits containing answer
        AND     Rb,Ra,#3        ; Correction factor in bytes
        MOVS    Rb,Rb,LSL #3    ; ...now in bits and test if aligned
        MOVNE   Rd,Rd,LSR Rb    ; If not aligned, produce bottom
                                ;  of result word
        RSBNE   Rb,Rb,#32       ; Get other shift amount
        ORRNE   Rd,Rd,Rc,LSL Rb ; Combine two halves to get result

Sign/zero extension of a half word

        MOV     Ra,Ra,LSL #16   ; Move to top,
        MOV     Ra,Ra,LSR #16   ; and back to bottom
                                ; Use ASR to get sign extended version

Return setting condition codes

CFLAG   *        &20000000
        BICS    PC,R14,#CFLAG   ; Returns clearing C flag
                                ;  from link register
        ORRCCS  PC,R14,#CFLAG   ; Conditionally returns setting C flag

This code should not be used except in user mode, since it will reset the interrupt mode to the state which existed when the R14 was set up. This rule generally applies to non-user mode programming.

For example in supervisor mode:

        MOV     PC,R14

is safer than

        MOVS    PC,R14

However, note that MOVS PC,R14 is required by the ARM Procedure Call Standard, used by code compiled from the high level language C. Such code, of course, runs in user mode.

Full multiply

The ARM's multiply instruction multiplies two 32 bit numbers together and produces the least significant 32 bits of the result. These 32 bits are the same regardless of whether the numbers are signed or unsigned.

To produce the full 64 bits of a product of two unsigned 32 bit numbers, the following code can be used:

; Enter with two unsigned numbers in Ra and Rb.
        MOVS    Rd,Ra,LSR #16           ; Rd is ms 16 bits of Ra
        BIC     Ra,Ra,Rd,LSL #16        ; Ra is ls 16 bits
        MOV     Re,Rb,LSR #16           ; Re is ms 16 bits of Rb
        BIC     Rb,Rb,Re,LSL #16        ; Rb is ls 16 bits
        MUL     Rc,Ra,Rb                ; Low partial product
        MUL     Rb,Rd,Rb                ; First middle partial product
        MUL     Ra,Re,Ra                ; Second middle partial product
        MULNE   Rd,Re,Rd                ; High partial product - NE
                                        ;  condition reduces time taken
                                        ;  if Rd is zero
        ADDS    Ra,Ra,Rb                ; Add middle partial products -
                                        ;  could not use MLA because we
                                        ;  need carry
        ADDCS   Rd,Rd, #&10000          ; Add carry into high partial
                                        ;  product
        ADDS    Rc,Rc,Ra,LSL #16        ; Add middle partial product
        ADC     Rd,Rd,Ra,LSR #16        ;  sum into low and high words
                                        ;  of result
; Now Rc holds the low word of the product, Rd its high word,
;  and Ra, Rb and Re hold junk.

Of course, the ARM7M core provides the Multiply Long class of instructions to perform a 64 bit signed or unsigned multiply or multiply-accumulate (see Multiply Long and Multiply-Accumulate Long).

This edition Copyright © 3QD Developments Ltd 2015
Last Edit: Tue,03 Nov 2015