Last visit was: Thu May 01, 2025 12:17 pm
|
It is currently Thu May 01, 2025 12:17 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
rf386 Message based interrupts, MSI (MSI for message signalling interrupts) were added to the core and system. A message-based interrupt sends an interrupt message on the main bus rather than having dedicated interrupt signals. For FTA bus the message is sent on the response bus by a device. There is no interrupt controller in the system. FTA MSI is fast. The device requesting an interrupt does not need to wait for an interrupt acknowledge cycle before placing information on the bus. The information has already been sent to the CPU. The CPU constantly monitors the response bus for interrupt messages. If there is an interrupt message it is placed in a queue. Every time an instruction is fetched the interrupt queue is popped if interrupts are enabled and if there was an interrupt, interrupt processing begins. (The queue is also popped in a string op loop). An issue to resolve yet is what to do if an interrupt message gets missed. For instance, if interrupts were disabled for a long time. If a device has not had its interrupt source cleared within a reasonable length of time, then it may need to resend the interrupt message. A reasonable length of time being something like 1000 cpu clock cycles. There are only 32 entries in the interrupt queue. I was not sure how many to support. A queue using a single block RAM could probably support 1024 entries. But my gut tells me that if there are a whole bunch of interrupts piling up then there is something wrong with the system. So, another thing that could be done is a fault if the interrupt queue becomes full.
An issue is the resolution of interrupt priorities when there is no interrupt controller present. For now, the interrupt priority is stuffed into the high order 4 bits of the status register. These bits are not used by i386. The interrupt popped from the queue has to beat the current interrupt level to be recognized. If the interrupt is not processed, it is placed back into the queue.
ToDo: implement error codes for faults. Some faults cause an error code to be written to the stack. This is not being done yet.
_________________Robert Finch http://www.finitron.ca
|
Thu Aug 08, 2024 2:24 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
rf386 A lot of work getting CALLs and string ops working, still not finished, but better. Also had to increase the number of prefix slots to four, there could be up to four prefixes for the rf386. Originally only two were supported. The cpu test program was updated. Many issues with call were due to incorrect mnemonics chosen for the operation. It is a learning curve getting used to std syntax.
ToDo: If a program is made with a continuous long stream of prefixes, then it prevents interrupts from being serviced. There should be a processor fault in that circumstance.
_________________Robert Finch http://www.finitron.ca
|
Fri Aug 09, 2024 4:30 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
rf386 Milestone: first far call instruction executed in protected mode.
Bug Fixes: ALU operations targeting memory were not updating memory.
Setting the CS:EIP is delayed until after the code segment descriptor is loaded for far transfers of execution. Both the CS and EIP are loaded at the same time. This is to avoid an intermittent invalid address generation which was confusing the instruction cache.
Bigfoot Native Mode Spec Shift instructions were made 24-bit opcodes. There is a restriction that only one of the first 16 registers can be used to hold a count value. Also shifts by immediate are limited to a count from 1 to 16. Compare instructions were given separate opcodes for signed, unsigned and float compares. They support 8,16,32, and 64 bit immediate operands. Other ALU operations support 6,14, and 30 bit immediate operands. Two extra bits were available for compares because the result goes to a condition register and there are only eight of those.
_________________Robert Finch http://www.finitron.ca
|
Sat Aug 10, 2024 4:26 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Bigfoot Made left shift a 24-bit instruction, and other shifts and rotates 32-bit instructions so they could share the primary opcode. Left shift is used the most often. Having a shorter opcode may help with code density. All shifts can now work with counts from 0 to 63 bits.
Added string operations like the x86 string ops. The string ops use only a 16-bit opcode, possible because the registers in use are assumed to be specific registers. The string compare (CMPS) and string scan (SCAS) may be more powerful than the x86’s. The comparison result is compared against the contents of a condition register for equality. This allows things like comparing all elements for greater than or less than. CMPS and SCAS also work with floating point values. There is no REP prefix. Repeat is indicated by a bit in the instruction as is direction.
Still a long ways to go on the spec yet, but getting there.
_________________Robert Finch http://www.finitron.ca
|
Sun Aug 11, 2024 2:22 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
rf386 Bug Fixes Page tables were setup improperly leading to the wrong code being loaded and a crash.
For the pointer load instructions, LDS etc. the descriptor was not being loaded in protected mode.
In the TLB address change detectors were looking at too many bits. A change in only the page needed to be detected. This was an efficiency issue. The TLB still worked but detected too many changes.
I thought there was an issue with the MMU. When crossing an 8kB boundary (the size of an MMU page), the cache indicated a hit when there should not have been one. Turns out data was referenced from the original 8k wrapping around, leading to incorrect instruction fetches. It turned out to be an off-by-one bit issue in the cache tag comparison; nothing to do with the MMU.
I did not realize that the selector for the LLDT instruction could come from a register. It always loaded from memory, incorrectly when a register was specified. The fix was to skip the memory load state as the selector was already available in a register.
Stats Still about 27k LUTs for the processing core, including I$ and MMU. From the test program the IPC is about 0.084 or a CPI of about 12. Lackluster performance, due to the type of implementation and loads of memory ops taking place.
_________________Robert Finch http://www.finitron.ca
|
Mon Aug 12, 2024 3:30 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Created a .md file with a basic description of the FTA bus. FTA bus is the bus used by my most recent cores including the rf386 and Bigfoot.
Made the FTA response buffer place responses on the output bus coming from the inputs in a circular fashion to create an equal priority system. Unless… there is a bus slot of higher priority than the chosen one. This prevents a single channel from hogging all the response bandwidth. Previously the buffers were searched in order which results in fixed priority. Normally devices will use the middle priority (7) out of (0 to 15). But some devices may specify a lower priority if they do not need rapid responses, for example the keyboard. Or they may specify a higher priority. A keyboard can likely wait many clock cycles between responses.
The rf386 stats could be improved significantly if there were a data cache. It might cut the data access time for loads in half. This is now being looked into. It would be nice to get the IPC > 0.1.
A first build of the test system was done. I figure I got it far enough along to try running some code in an FPGA. It passed 40 MHz timing (failed 66 MHz). However, the test system did not light the LEDs on the FPGA board, so something is amiss.
_________________Robert Finch http://www.finitron.ca
|
Tue Aug 13, 2024 2:30 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
rf386 Eliminated a state from the RET instruction.
I added a simple 4kB data cache made out of LUTs and discovered there were issues with the operation of string instructions. The data cache is a writeback cache for slightly better performance.
String instructions are currently broken. Either data is not being stored correctly or it is not being loaded correctly. The CPU runs through a number of string operations successfully but then fails on a later one. It’s the scan word instruction that fails because the data being compared does not match the ax register. Interestingly the compare string instruction seems to pass its test. I am investigating the string store operation used to setup the test.
Adding the data cache trimmed about four clocks off the average instruction. IPC is now about 0.098 or 10 CPI with the data cache turned on. The data cache added about 10k LUTs to the core.
_________________Robert Finch http://www.finitron.ca
|
Wed Aug 14, 2024 2:08 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Milestone: Got the IPC up to 0.121 or about CPI 8. That is counted each string op rep as an instruction. This is with data cache on and working string ops. The bugfix was the shift amount for the data extract was being toasted as the least significant address bits were set to zero for the second bus cycle. The low order address bits needed to be retained for the extract.
Load operations were reviewed to ensure that they only specify the bytes needed with the byte lane select signals. It is tempting to just get all the bytes in a 128-bit load, then sort them out later, but the load could be for a memory mapped I/O device. Specifying any extra bytes might wreck-havoc on an I/O load operation. Also the bus bridges expect 32-bits indicated when the bridge is accessing a 32-bit device.
Issue: there is no bus lock signal for unaligned memory ops that require two bus cycles. One thought is to have double-width data transfers and allow the memory controller to handle using two bus cycles if needed. In the test system with only a single active bus master it probably is not an issue. A double-width bus is needed to support the CAS instruction.
Spent today upgrading FPP64, the C preprocessor. I am trying to add assembler language support including things like repeat blocks. I have it almost working, except that nested repeats are being challenging.
_________________Robert Finch http://www.finitron.ca
|
Fri Aug 16, 2024 4:22 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Started working on the Arpl compiler for Bigfoot native mode. For a few moments I toyed with the idea of an i386 code generator. It would work using 32 global memory locations as registers. Taking the Qupls code and renaming most of it. But Bigfoot has eight condition code regs, so the compiler is going to be slightly different. I have managed to keep almost the same mnemonics between several different cores. It makes the compiler easier to modify.
_________________Robert Finch http://www.finitron.ca
|
Tue Aug 27, 2024 3:58 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Got some compiler output, and assembled with a partially written assembler. Ended up modifying the Bigfoot architecture somewhat based on not wanting to modify the compiler too much. It is looking like the average instruction length for Bigfoot is somewhere around 28-bits. A lot of instructions are 24-bit. Pulled the string instructions out of the architecture for now. The 3-second delay boot routine (compiled from arpl code and assembled): Code: 10: _Delay3s: 00:0000000000000000 0400A0 11: sub sp,sp,24 00:0000000000000003 2CE003 12: store fp,[sp] 00:0000000000000006 1C0000 13: move fp,sp 00:0000000000000009 0400E0FE 14: sub sp,sp,72 00:000000000000000D 2CE003 15: store s0,[sp] 00:0000000000000010 2CE003 16: store s1,8[sp] 17: ; integer* leds = 0x0FFFFFFFFFEDFFF00; 00:0000000000000013 380000 18: loadi s1,-18874624 19: ; for (cnt = 0; cnt < 6000000; cnt++) 00:0000000000000016 1C0000 20: move s0,r0 21: .00017: 22: ; leds[0] = cnt >> 17U; 00:0000000000000019 1800002022 23: lsr t0,s0,17 00:000000000000001E 10 00:000000000000001F 2C8002 24: store t0,0[s1] 00:0000000000000022 3D20 25: addq s0,1 00:0000000000000024 2CC003 26: store a1,-72[fp] 00:0000000000000027 380000 27: loadi t1,6000000 00:000000000000002A 1D0000 28: cmp cr2,s0,t1 00:000000000000002D 1A400600 29: blt cr2,.00017 30: .00016: 00:0000000000000031 27E003 31: load s0,[sp] 00:0000000000000034 27E003 32: load s1,8[sp] 00:0000000000000037 1C0000 33: move sp,fp 00:000000000000003A 27E003 34: load fp,[sp] 00:000000000000003D 012000 35: retd 32 Still far from perfect, some of the opcodes did not assemble correctly.
_________________Robert Finch http://www.finitron.ca
|
Wed Aug 28, 2024 3:37 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
robfinch wrote: Got some compiler output, and assembled with a partially written assembler. snip Still far from perfect, some of the opcodes did not assemble correctly. Yes, Bad things can happen if the assembler,simulator,hardware are all not in sync or the same version.
|
Thu Aug 29, 2024 7:36 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Yes, Bad things can happen if the assembler,simulator,hardware are all not in sync or the same version. Yes, I would agree, I think a way to keep them in sync is to develop all at the same time. Changing an opcode spec has to ripple down to the compiler and assembler. It can maybe be done by one person that way. Otherwise better to have the spec solidified before going further. Work on the Arpl compiler went faster today than expected. A large part of a port for i386 code was written. The compiler and assembler were also updated to support Bigfoot native mode. Output for the 3-second delay routine, with many corrections, now looks like this: Code: 8: #{++ _Delay3s 9: .sdreg 29 10: _Delay3s: 00:0000000000000000 04FFA3 11: sub %sp,%sp,$24 00:0000000000000003 2CFE03 12: store %fp,[%sp] 00:0000000000000006 1CFE03 13: move %fp,%sp 00:0000000000000009 44FFE3FE 14: sub %sp,%sp,$72 00:000000000000000D 2CF303 15: store %s0,[%sp] 00:0000000000000010 2CF423 16: store %s1,8[%sp] 17: # integer* leds = 0x0FFFFFFFFFEDFFF00; 00:0000000000000013 C41400FC7FFB 18: add %s1,%r0,$-18874624 19: # for (cnt = 0; cnt < 6000000; cnt++) 00:0000000000000019 1C1300 20: move %s0,%r0 21: .00017: 22: # leds[0] = cnt >> 17U; 00:000000000000001C 16694600 23: lsr %t0,%s0,$17 00:0000000000000020 2C8902 24: store %t0,0[%s1] 00:0000000000000023 7D13 25: addq %s0,$1 00:0000000000000025 2CC2E3FE 26: store %a1,-72[%fp] 00:0000000000000029 030800366E 27: cmp %cr2,%s0,$6000000 00:000000000000002E 1A400700 28: blt %cr2,.00017 29: .00016: 00:0000000000000032 27F303 30: load %s0,[%sp] 00:0000000000000035 27F423 31: load %s1,8[%sp] 00:0000000000000038 1CDF03 32: move %sp,%fp 00:000000000000003B 27FE03 33: load %fp,[%sp] 00:000000000000003E C10300 34: retd $24
The i386 port makes extensive use of pseudo-registers, _r0 to _r31. It does a lot of move operations to and from memory, and the code is probably triple the size of the average i386 compiler. There is some optimization applied, but the compiler has a ways to go before it generates decent i386 code. The same routine compiled for i386 (with a few errors yet): Code: .sdreg 29 _Delay3s: # integer* leds = 0x0FFFFFFFFFEDFFF00; mov $-18874624,%eax mov %eax,_r19 # for (cnt = 0; cnt < 6000000; cnt++) mov _r0,%eax mov %eax,_r18 mov _r18,%edx mov $6000000,%eax mov %eax,%ebx cmp %ebx,%edx jge .00018 .00017: # leds[0] = cnt >> 17U; mov _r18,%eax shr $17,%eax mov %eax,_r9 mov _r18,%eax add $1,%eax mov %eax,_r18 mov _r18,%edx mov $6000000,%eax mov %eax,%ebx cmp %ebx,%edx jl .00017 .00018: .00016: ret
There are some lines of code missing from the output that should be there. I think maybe they are being optimized away and they should not be.
_________________Robert Finch http://www.finitron.ca
|
Fri Aug 30, 2024 2:29 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Figured out why there was missing output for the i386. The indexing operation generator got confused by the fact that pseudo-registers were being used. It could only handle real registers. So, it decided not to output anything. Giving some thought now to trying to use the color-graphing register allocator and a virtual register set. In theory the register allocator can map virtual registers to real registers, spilling the real registers to an from memory as needed. It would be great if it worked… It has not been tried, although it looks like it works when source is compiled with it enabled. Thinking of putting the i386 port on the shelf. There are already compilers for the i386 available.
Re-thinking the condition registers. Their intended use was to absorb some of the Boolean values when expressions are evaluated, primarily for branches. However, they do not get much use by the compiler other than for the instruction immediately prior to the branch. When evaluating expressions, it is easier to stick to the GPRs rather than have two types of registers. It is a lot easier to spill GPRs during the evaluation of complex expressions. There is no way to write a condition register directly to memory. It must be loaded into a GPR first. Transferring between GPRs and CRs increases code bloat and kills a lot of the reason to use CRs. Unless a bunch more specialized instructions are added the CRs will have limited use.
_________________Robert Finch http://www.finitron.ca
|
Sat Aug 31, 2024 4:51 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Only two opcodes left at the root level. One of them may be used for wider (40-bit) instructions, which would include things like bitfield operations. I also want to reserve an opcode to select a second page of root opcodes. It may be possible to free up a couple of more opcodes. There are really about 256 opcodes in play. Most of the opcodes at the root level have four different lengths associated with them. [list=]Pink=memory type instructions Green=flow control type instructions Grey=ALU type instructions Orange=system control type instructions Yello=Compare type instructions [/list] Attachment: RootOpcodeMap.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Mon Sep 02, 2024 5:47 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Worked on an assortment of things today. Updated the Bigfoot documentation, about 250 pages now. Improved the tracking of temporaries use in the arpl compiler and managed to eliminate some pushes / pops from the output. Found out that register constant propagation was not working due to a register being reassigned a value in a loop. So, I disabled that optimization for now. Needs more work. Did some more study of capabilities. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-987.pdf
_________________Robert Finch http://www.finitron.ca
|
Tue Sep 03, 2024 3:09 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|