Last visit was: Thu May 01, 2025 11:35 pm
|
It is currently Thu May 01, 2025 11:35 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
I was just wondering why multiply and divide operations aren't usually made interruptible ? It seems like they would make them interruptible to reduce the interrupt latency. All it would take is storing the intermediate state when an interrupt occurs then restoring it on the return.
_________________Robert Finch http://www.finitron.ca
|
Thu Sep 01, 2016 10:16 am |
|
 |
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob: When I looked into implementing an interruptable block move instruction for my core, I found that the internal state that I would need to preserve in order to correctly restart the instruction was quite extensive. So to conserve logic resources, I opted to provide a way for the programmer to control the interrupt latency the using the block move instruction would introduce. Thus, there is a single transfer mode where the block move instruction is interruptable after every transfer. The penalty for using this mode is that a conditional branch must be used by the programmer to continue the transfer. The benefit of this mode is that no special internal structures or microcode sequences are needed to provide low latency interrupt handling. The block move instruction also operates in an uninterruptable manner. In this mode, the length of the transfer will determine the interrupt latency. The programmer can choose to use block size that provides the interrupt latency he/she/it  is willing to tolerate. Thus, for my core to support interruptable, high cycle count instructions a significant increase in the complexity of the control microprogram and internal structures would have been required to hold the processor state to allow the instruction to be restarted following the return from the ISR. Depending on the multiplier/divider (floating point or integer), I can see a similar problem arising. One solution I would consider would be to partition the instructions into separate interruptable, low latency instructions that only accomplish a portion of the whole uninterruptable instruction, so that the multiplication/division process is interruptable at clearly defined points, and easily supported by the processor interrupt service process.
_________________ Michael A.
|
Thu Sep 01, 2016 11:42 am |
|
 |
Garth
Joined: Tue Dec 11, 2012 8:03 am Posts: 285 Location: California
|
I have not sharpened my pencil to look into the details of a hardware divide, but multiply can be done without any clocking, so the output is formed just as fast as the propagation delays allow. It may take longer than a single system clock cycle, but not dozens. I'm not sure there's any quick way to do a divide though. A possibility to look into is whether you can invert (so it's always specifically 1 divided by...), and then multiply by that inverse. That's what my large look-up tables for 16-bit scaled integer math are geared for. Each of the 65,536 inputs gives a pre-calculated 32-bit output, so you can use the bytes you want without losing accuracy at either end. Invert, then use other tables to speed up the multiply. If speed and low interrupt latency is important, is it feasible to make the hardware do some factoring before starting the divide? Or how about using different instructions, or at least different paths, for different precisions, so you're not going for maximum precision when you don't need to? I think everyone here knows how the block-move instructions work on the 65816. It's interruptible, because the instruction repeats itself for every byte it does, re-fetching itself and the source and destination banks every time, so it takes seven clocks per byte. So it's kind of slow, but because of the way I use interrupts sometimes (for audio sampling), I like the fact that no instruction is more than 9 clocks. 20 or 50 or 100 or 1000 would not be acceptable.
_________________http://WilsonMinesCo.com/ lots of 6502 resources
|
Thu Sep 01, 2016 8:15 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: I have not sharpened my pencil to look into the details of a hardware divide, but multiply can be done without any clocking It still takes several clocks cycles for a multiply to complete at a high clock frequency compared to other logic. Multi-cycle paths have to be specified with appropriate waiting in the pipeline. The multiply is one of the things that shows up sometimes as being on the critical path for timing. I've broken the multiply instruction up into several partial products sums that take five or six cycles to complete (using multipliers in DSP slices). It gets tricky, I'd prefer to go with a bit simpler bit-by-bit multiplier. Divide can be done the same way a multiply is without clocking, but it is much slower (I'd say 4 to 5 times) because of the conditional logic and additional multiplexing. It's so much slower that it's maybe faster to do a clocked Newton's divide. I've got a divider that processes four bits at a time using cascaded logic completing in a about 10 clock cycles for a really slow 32 bit core. If you don't have to worry about the clock frequency a single cycle divider would be fine. Quote: I think everyone here knows how the block-move instructions work on the 65816.... It's one thing that could be improved on the 65816. It could have an interruptible block move without re-fetching the instruction every time. It could be reduced probably to three or four cycles. The hardware's a bit trickier to implement because the interrupt status would have to be tested in an additional location. Quote: That's what my large look-up tables for 16-bit scaled integer math are geared for... Lookup tables are an awesome way to do some of the mathematics very fast. They get used in hardware in place of arithmetic functions like divide. I used a reciprocal lookup table and a multiply instead of a divide for calculating character positions in a text controller. A software routine like the following could be used to perform a multiply. It has a couple of oddball instructions to speed things up. ADDOD adds conditionally based on whether or not the LSB of a register is set. Avoids branching in the multiply. It only take six clocks per bit to do the multiply in software. Code: ; computes signed product of r1 and r2 and returns it in ; r3(low) and r4(high) ; push r5,r6,r10 mov r3,r0 ; zero out product mov r4,r0 ldi r6,#64 ; 64 bits to process eor r10,r1,r2 ; compute sign of result abs r1,r1 ; take absolute value of operand abs r2,r2 j1: ; M1 shl r5,r4,#63 ; shift LSB to MSB shr r4,r4,#1 ; shift product to right shr r3,r3,#1 ; or r3,r3,r5 ; set MSB of LSW addod r4,r4,r2,r1 ; add r2 to r4 if r1 is odd shr r1,r1,#1 ; shift op over to next bit ; M2 add r6,r6,#-1 ; decrement loop count bne r6,r0,j1
bge r10,r0,j2 ; to make negative, take one's complement ; then increment. eor r5,r3,#-1 ; take one's complement eor r6,r4,#-1 add r3,r5,#1 ; increment LSW caci r10,r3,r5,#1 ; compute carry from addition add r4,r6,r10 ; add carry to MSW j2: pop r10 pop r6 pop r5 rts
_________________Robert Finch http://www.finitron.ca
|
Thu Sep 01, 2016 10:50 pm |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|