Last visit was: Thu May 01, 2025 12:12 pm
|
It is currently Thu May 01, 2025 12:12 pm
|
Author |
Message |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
You may want a few more segments if you have a window display, and heap.
|
Sun Mar 30, 2025 4:43 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
It is a good idea but I am not sure I want to dedicate address bits to referencing them. PowerPC has 16 segment registers, I'd follow the same pattern. But ATM the bounds registers can use means other than the address to be selected. For instance, the code bounds is selected for instruction fetches. The stack bounds can be detected by the use of the stack pointer register in instructions, and the remaining bounds are for data. I switched to having base and bound registers (instead of just bounds), adding more is tempting. It is also more context to save on context switches and interrupts. I had been working on context switch code earlier today. I was going to micro-code it, but it is too complex and too many things could go wrong. I ended up adding a limit feature to the BLR (branch-to-link) instruction for tabular jumps. It is somewhat similar to the memory indirect on the My66000. The branch register contains a table offset, and it must be between zero and the limit. Otherwise it branches to the limit address. I also discovered an issue with jump tables and the way immediate constants are handled. Because large constants are placed at the end of a cache line, a jump table cannot have any large constants in it, or they could end up in the middle of a table. Sample context switch code: Code: # loadi br6,XJMP # select context switch function loadi a0,destination TCB sys # call the system # we get back here after the context switched
# system dispatcher # The system routine must end with RFI blr DispatchTable[b6],DispatchTableLimit
DispatchTable: .4byte ContextSwitch # Context switch code ContextSwitch: csrrw r0,SCRATCH,a0 # save a0 in scratch register csrrd a0,TCBA # get TCB pointer store a1,8[a0] # save a1 store a2,12[a0] store a3,16[a0] store a4,20[a0] store a5,24[a0] store a6,28[a0] store a7,32[a0] store t0,36[a0] store t1,40[a0] store t2,44[a0] store t3,48[a0] ... csrrd a1,SCRATCH # get value of a0 store a1,4[a0] # save it # check if the FP registers should be saved ... # save stack pointers and branch registers movea a1,USP store a1,256[a0] movea a1,SSP store a1,260[a0] movea a1,HSP store a1,264[a0] movea a1,MSP store a1,268[a0] move a1,BR1 store a1,276[a0] move a1,BR2 store a2,280[a0] ... # store the return program counter ... move a1,CR0 store a1,320[a0] move a1,CR1 store a1,324[a0] ... move a1,LC store a1,352[a0] move a1,MCLR store a1,356[a0] move a1,MCPC store a1,364[a0] csrrd a1,SR store a1,384[a0] csrrd a1,PBL store a1,388[a0] # # Load the destination context csrrd a0,SCRATCH # get destination TCB csrwr r0,TCBA,a0 # update running TCB address load a1,388[a0] move PBL,a1 load a1,384[a0] move SR,a1 load a1,364[a0] move MCPC,a1 ... # walk forwards loading registers load a1,8[a0] load a2,12[a0] load a2,16[a0] load a3,20[a0] ... rfi
I am trying to use some code example to determine if there is anything amiss. I decided to add an explicit 'loadi' (load immediate) as it can load more bits than using an add or or instruction. It can also access more registers than just the GPRs. I have the move instruction setup the same way. It can access all the programmable registers (96 of them). A normal instruction can access only 32 regs. Not keen on adding 12 more segment registers. But maybe four more. They are not accessible to user / app mode. Only OS software needs to know about them.
_________________Robert Finch http://www.finitron.ca
|
Sun Mar 30, 2025 5:57 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
A bug was found in the 6809 version of Femtiki, I could not get it to work very well. It turns out there was a bug in the timer decrement routine. I found it when porting the code over for Qupls3.
I got a good chunk of the kernel ported over for Q+3 and changed a couple of instructions as a result. The PUSH and POP instructions can now push or pop a list of registers instead of just four max. Up to 17 regs may be pushed or popped with a single instruction. The register list is specified in groups since there are potentially 96 registers that could be stacked. The new version of PUSH and POP can also handle all 96 registers. Before there were separate instructions to push only integer registers or floating point registers.
Mainly worked on the task switch code. It is over 300 LOC in assembler.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 01, 2025 3:48 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Changes / Additions: Added the BMAP instruction to the architecture which can map bytes from a source onto a destination. It is very flexible and can permute bytes, reverse order, broadcast, mix and shuffle. It got added because it is used by software in the library functions which I did not want to re-write. Changed the FTA bus so that there is only a single address in the command request and added a flag as to whether it is virtual or physical. Previously it had separate physical and virtual address fields. This was extra bus baggage as there is only ever one used at a time. Had to go through all the peripherals and change .padr to .adr. Going into the MMU it is virtual; coming out of the MMU it is physical. Also removed the stb signal from the bus. Stb was a holdover from the WISHBONE bus to control when data was strobed. It did not apply in the FTA bus. And, removed the asid from the bus. It had no business being there. Asid is just used by the TLB to prevent the need for it to be flushed on a context switch. In the MMU, the bus retry wait was changed to be a random number of cycles from 1 to 32. Previously it was a fixed wait time of 32 cycles.
Bug fixes: In the MMU the select lines were being set inactive if a bus retry was needed.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 03, 2025 8:09 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Is the random bus retry a means of improving test coverage, or is it a feature somehow improving performance?
|
Thu Apr 03, 2025 8:28 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Is the random bus retry a means of improving test coverage, or is it a feature somehow improving performance? It is an attempt to avoid collisions on a shared bus. If the wait times were fixed and there was bus contention it might get stuck repeating. Of course if the same LFSR is used in different devices to generate a random wait, the result might be the same anyway. It should also improve performance as the average wait time is reduced.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 03, 2025 2:31 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Ah, I see. I wonder how a fibonacci backoff might perform, perhaps with a small random perturbation if that's appropriate.
|
Thu Apr 03, 2025 2:59 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: I wonder how a fibonacci backoff might perform Fibonacci might be a good one to try. I was looking for something really simple that does not use a lot of hardware. I should look at network controllers again. Could also do quadratic backoff. Changes / AdditionsAdded a feature to branch-and-link instructions. Normally the destination link register cannot be br7 as that is a read-only reference to the program counter to generate program counter relative addresses. So, storing a return address there is illegal and would generate an illegal instruction trap. However, this has changed to generate a call to an interrupt subroutine instead. So, it is now possible to easily run hardware interrupt routines from software. This arose from the need to invoke the timer ISR to perform context switching in some cases from OS calls. For instance sleep() now causes a context switch. Also called from StartTask() and ExitTask() when I get around to finishing those routine.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 04, 2025 3:19 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Added code in the assembler to merge common constants together, using the same constant bucket. This reduces code size. It happens surprisingly often. ChangesChanged the PRED instruction to reusing conditional branch instructions. Since it is illegal for a conditional branch to store a return address in the PC register, that opcode is repurposed to represent a predicate “branch”. The instruction has the same format as the branch instruction, except that it operates under the opposite condition. The predicate instruction accepts a destination label like a branch, but the destination label represents where the predication stops. If the predicate is false, the CPU will skip over instructions until it hits the destination label. Otherwise, the instructions will be executed. The destination label for a predicate must come after the predicate and be within 13 instructions. The branch displacement field turns into an instruction mask field for a predicate. The reason a mask is used instead of a displacement: a displacement would require an address comparator for every instruction following the predicate in the re-order buffer. A mask is a lot less expensive. It just loads into a shift register and shifts out as instructions are encountered. Code: BEQ mylabel # branches to my label if condition is true PEQ mylabel # executes instructions if the condition is true
The following code shows the use of a predicate, note that there are no branches in the code. Compare results are combined using CR operate instructions to generate a predicate. One might think it may be slow without branches bypassing some of the code, however the code should only take about 3 clocks to execute. If there were a branch miss it would take about 10 to 13 clocks for each miss. The code with branches could end up taking considerably longer to execute. Code: 284: FMTK_LockSemaphore: 285: macAdrCheck %a2 01:00000104 031A0000 1M cmpai %cr0,%a2,0 # NULL pointer? 01:00000108 431A00A0 2M cmpai %cr1,%a2,0x00800000 # too low 01:0000010C 831A04A0 3M cmpai %cr2,%a2,0xC0000000 # too high 01:00000110 0B002C02 4M cror %cr0?eq,%cr0?eq,%cr1?lt 01:00000114 0B00500E 5M crorc %cr0?eq,%cr0?eq,%cr2?le 01:00000118 F93E80E0 6M peq %cr0,.00017 01:0000011C 44000200 7M loadi %a0,E_Arg 01:00000120 1B5A0000 8M b OSExit 9M .00017:
The strange question marks in the compare instructions indicate which bit of the condition register to select. I tried to use a ‘.’ But the assembler insisted on interpreting the result as a number and gave a divide by zero error. Managed to free up an opcode by moving the instructions into used portions of other opcodes. There are 12 out of 64 opcodes available.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 07, 2025 5:19 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
That's quite something - the CPU can execute some 7 or 8 instructions in 3 clocks if predication makes enough of them NOPs.
Thanks for the example - unfortunately I quite don't understand it! I'm having to guess: 3 comparisons set flags 2 predicated instructions do something, or nothing 1 predication marker marks two subsequent instructions as predicated 2 further instructions now do something, or nothing
Is that about right?
|
Mon Apr 07, 2025 7:08 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Faster is good, providing you don't out pace your cache.
|
Mon Apr 07, 2025 8:22 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: hat's quite something - the CPU can execute some 7 or 8 instructions in 3 clocks if predication makes enough of them NOPs.
Thanks for the example - unfortunately I quite don't understand it! I'm having to guess: 3 comparisons set flags 2 predicated instructions do something, or nothing 1 predication marker marks two subsequent instructions as predicated 2 further instructions now do something, or nothing
Close. The "2 predicated instructions" are not predicated instructions. They are instructions that combine condition register bits logically. They are always doing something. It's "cr" operate instructions. The predication marker uses the status of the condition register to predicate the following instructions. The branch label indicates the window of the predicated instructions. The assembler calculates a mask for the instructions instead of a displacement. I think it may be possible for the CPU to execute all eight instructions in just two clock cycles, I may be overly optimistic. But I would need to measure it, and I do not have a working CPU ATM. The "cr" instructions execute one-at-a-time, but the compares can all execute in parallel. The '?' in the CR register names is in lieu of a '.'. I could maybe change it to [] if that makes more sense. CR0?EQ means the EQ bit in the CR0 register. So, the individual bits are being 'or'd "cror" I have stared at assembler for so long, I think it is easy to understand.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 07, 2025 1:09 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Thanks for explaining!
|
Mon Apr 07, 2025 1:15 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Faster is good, providing you don't out pace your cache. Yes. It'll be fast if everything is loaded in the cache. But otherwise quite slow. I have not released anything yet. I am beginning to believe things are just vapor ware. A lot of learning and experimentation with ideas, to innovate just the right mix.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 07, 2025 1:18 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Modified the system call instruction to only escalate the call to the next higher operating mode or environment. Renamed “SYS” to “ECALL”. Previously it would switch to machine / secure operating mode. ECALL also now accepts a vector number to call. This is to allow for different environment call dispatchers. There are 4096 vectors allowed. Vectors zero through ten are reserved for the operating system dispatchers. Vector eleven is for built in ROM routines.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 08, 2025 2:13 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|