Last visit was: Thu May 01, 2025 5:16 pm
|
It is currently Thu May 01, 2025 5:16 pm
|
Author |
Message |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
roelh wrote: Hi all, my C compiler is now integrated into the Kobold K2 online Javascript assembler/simulator. On a single browser page you can now compile, assemble and simulate the C code, and download a file to program into the Kobold. See it live at http://www.enscope.nl/simas_kobold2/ ! Very impressive - nice work! It looks like you can now cope with the working set not fitting within registers - how did you approach register allocation / spillage?
|
Wed Mar 11, 2020 10:57 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Thank you Robinson ! About the register allocation: The workspace holds max 16 locals in RAM. At a function call, the workspace pointer WP is incremented by 32, so now the called function has 16 fresh local variables (but for leaf functions the locals are in zpage, this saves the instructions needed to change WP). There are 8 registers (4 data and 4 address). PC (A0) and WP(A1) are address registers, and one register is needed for the return address, this leaves 5 registers for parameter passing. So the max nr of parameters is 5. This could be extended by having another parameter pass convention for parameter 6 and higher, for instance they could be directly placed into the next workspace). Calculations normally are done in register D3. This is also the register for function results and for the first parameter of a function call. Within the code generator, there is a function that returns a new register to use (and reserves it), but it is currently only used one level deep. Address calculations are mostly done in D2 and A2. For expressions, the code generator will first generate code for the deepest branch of the expression. This minimizes the number of intermediate results. But if intermediate results must be stored, this is generally done in the workspace. The code generator does several other optimizations that will be described later. There are no compare instructions in the Kobold instruction set, a subtract must be used (that is an add with the 2's complement). So the register that is compared will be changed after the instruction. But there are also ADD instructions that deliver the result in an address register, like: MOVC (var),D3 ; variable to D3 and complement it ADDI 45,D3,A3 ; add and increment. Comparison result comes in carry. This can used for compare. The D3 value is not changed, and the result in A3 is simply ignored. You can have a look at the code generator here: http://www.enscope.nl/simas_kobold2/generator.js
|
Thu Mar 12, 2020 8:46 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Nice - can confirm, 3 is prime, and 9 isn't. (You might find that a fast-forward function will help a lot - allow for multiple steps between GUI updates. Ken's calculator simulator runs a few hundred(?) steps at normal speed and then speeds up. Visualy6502 takes a parameter to adjust the fast-forward button's step length.)
|
Thu Mar 12, 2020 9:29 am |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi all, I did a benchmark test to check my C compiler in combination with my computer Kobold K2, against a few known systems. You find my full test description here: https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/180963-how-fast-is-this-thing-anywayRan the Sieve of Eratosthenes: (tests mainly the speed of loops and writing to arrays). result: 3.3 sec 6502 - The 6.25 MHz Kobold K2 programmed sieve in C, is more than 10 times faster as a 2 MHz 6502 programmed in optimized assembler. 68000 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C, Ada and Pascal compilers (available in 1983) targeting an 8 MHz 68000 ! 8086 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C compilers targeting a 10 MHz 8086 ! Ran the recursive Fibonacci algorithm: (tests mainly the speed of recursive function calls and returns) result: 5 sec 8086 - The 6.25 MHz Kobold K2, programmed Fibonacci in C, beats all C compilers targeting a 10 MHz 8086 with a factor 3 or more !
|
Sun Jul 19, 2020 5:17 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
The source code of the C compiler and editor/assembler/simulator can be found here: https://hackaday.io/project/167605/filesAnd the short description here: https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/182395-c-compiler-assembler-simulator-source-codePerhaps some of you dare to look at it. I would not be surprised when you see a resemblance to a certain Italian food...
|
Wed Aug 19, 2020 8:07 pm |
|
 |
drogon
Joined: Sun Oct 14, 2018 5:05 pm Posts: 62
|
roelh wrote: Hi all, I did a benchmark test to check my C compiler in combination with my computer Kobold K2, against a few known systems. You find my full test description here: https://hackaday.io/project/167605-kobold-k2-risc-ttl-computer/log/180963-how-fast-is-this-thing-anywayRan the Sieve of Eratosthenes: (tests mainly the speed of loops and writing to arrays). result: 3.3 sec 6502 - The 6.25 MHz Kobold K2 programmed sieve in C, is more than 10 times faster as a 2 MHz 6502 programmed in optimized assembler. 68000 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C, Ada and Pascal compilers (available in 1983) targeting an 8 MHz 68000 ! 8086 - The 6.25 MHz Kobold K2 programmed sieve in C, beats all C compilers targeting a 10 MHz 8086 ! Ran the recursive Fibonacci algorithm: (tests mainly the speed of recursive function calls and returns) result: 5 sec 8086 - The 6.25 MHz Kobold K2, programmed Fibonacci in C, beats all C compilers targeting a 10 MHz 8086 with a factor 3 or more ! Hi, At the risk of straying OT but always looking for something to compare my own system to, I gave this a go... Firstly my system is a BCPL program compiled into a 32-bit VM/bytecode interpreter running on a 16Mhz, 16-bit CPU (65c816) with an 8-bit data bus, so it's never going to be optimal, however, assuming my interpretation (and translation) of the code is correct and you're doing 10 iterations of the inner loop, then I get 13.014 seconds. Not as fast as your C version but I'm pleasantly surprised it's on-par with the AVR at the same clock speed. Code: GET "libhdr.h" GET "sys.h"
MANIFEST { size = 8190 iterations = 10 }
LET start () = VALOF { LET flags = getvec (size + 1) LET tStart, tEnd = ?,? LET count = 0
writes ("Starting*n") UNLESS flags THEN { writes ("Out of memory*n") RESULTIS 1 }
tStart := sys (Sys_cputime) FOR iter = 1 TO iterations DO {
FOR i = 0 TO size DO flags!i := TRUE
FOR i = 0 TO size DO { LET prime, k = ?, ?
IF flags!i THEN { prime := i + i + 3 k := i + prime WHILE k <= size DO { flags!k := FALSE k := k + prime } count := count + 1 } } } tEnd := sys (Sys_cputime)
writef ("Time: %d ms*n", tEnd - tStart) writef (" %d primes found.*n", count)
freevec (flags)
RESULTIS 0 } Code: Starting Time: 13014ms 18990 primes found. Cheers, -Gordon
|
Thu Aug 20, 2020 1:04 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
My bengol compiler accepts a simple version of c with some code changes for porting purposes. I get about 10 seconds. on my 20 bit 1.33 us core memory computer. 1899 primes. Looking at the byte benchmarks, very few tests have same speed and bus width. 4 Mhz z80/IBM pc about the speed. Ben.
You do not have the required permissions to view the files attached to this post.
|
Thu Aug 20, 2020 3:36 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Ok, although I'm a bit late to join, I also accepted the challenge and tried with my CPU74 architecture and compiler This is the test C code Code: #include "systemio.h" #include "system.h"
#define true 1 #define false 0 #define size 8190 #define sizepl 8191 char flags[sizepl]; int main() { int i, prime, k, count, iter; myprintstr("10 iterations\n"); for (iter = 1;iter <= 10;iter ++) { count=0 ; for (i = 0;i <= size;i++) flags [i] = true; for (i = 0;i <= size;i ++) { if (flags[i]) { prime = i + i + 3; k = i + prime; while (k <= size) { flags[k] = false; k += prime; } count = count + 1; } } } putchar('\n'); printnum(count); myprintstr(" primes\n\n"); } Resulting assembly code: [CPU74 assembly] Code: .globl main main: add SP, -8, SP st.w r4, [SP, 6] st.w r5, [SP, 4] st.w r6, [SP, 2] st.w r7, [SP, 0] mov &.L.str, r0 call @myprintstr mov 1, r6 mov 8191L, r5 mov 0, r7 .LBB0_1: mov &flags, r0 mov 1, r1 mov r5, r2 call @memset mov 3, r1 mov 3, r2 mov 1, r4 mov 0, r0 mov 0, r3 .LBB0_2: and r4, 255L, r4 brcc .LBB0_7 add r3, r3, r4 add r4, r3, r4 lea r4, 3, r4 cmp.ugt r4, 8190L brcc .LBB0_6 mov r2, r4 .LBB0_5: st.b r7, [r4, &flags] add r4, r1, r4 cmp.ult r4, 8191L brcc .LBB0_5 .LBB0_6: lea r0, 1, r0 .LBB0_7: lea r3, 1, r3 cmp.eq r3, 8191L brcc .LBB0_9 lea r1, 2, r1 lea r2, 3, r2 ld.sb [r3, &flags], r4 jmp .LBB0_2 .LBB0_9: lea r6, 1, r6 cmp.eq r6, 11 brncc .LBB0_1 mov 10, r1 st.b r1, [-1L] call @printnum mov &.L.str.1, r0 call @myprintstr mov 0, r0 ld.w [SP, 0], r7 ld.w [SP, 2], r6 ld.w [SP, 4], r5 ld.w [SP, 6], r4 add SP, 8, SP ret
# --------------------------------------------- # Global Data # --------------------------------------------- .section .rodata.str1.1,"aMS",@progbits,1 .L.str: .asciz "10 iterations\n"
.comm flags,8191,2 .L.str.1: .asciz " primes\n\n"
As you can see the compiler is clever enough to figure out that the initialisation loop can be replaced by a convenient call to memset. As a matter of info, this is my implementation of memset. It's optimised for the word size of the processor (16 bits) but not even programmed in assembler: [memset function] Code: __attribute__((builtin)) void *memset (void *dst, int val, unsigned int len) { if ( !((int)dst & 1) ) { int *d = dst; int xval = (val<<8) | (val&0xff); for ( int i=0 ; i<(len&0xfffe) ; i+=2 ) *d++ = xval; if ( len & 1 ) *(char*)d = (char)val;
return dst; }
char *d = dst; for ( int i=0 ; i<len ; ++i ) *d++ = (char)val;
return dst; } And now the results on the CPU74 simulator, that I must recall that it is cycle accurate. [Simulation results] Code: /Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim 10 iterations
01899 primes
Executed instruction count: 2187612 Total cycle count: 2801131 Elapsed simulation time: 3.87819 seconds Calculated execution time at 1MHz : 2.80113 seconds Calculated execution time at 8MHz : 0.350141 seconds Calculated execution time at 16MHz: 0.175071 seconds Program ended with exit code: 0 Well, I guess this beats them all by an astronomic margin !!
Last edited by joanlluch on Fri Aug 21, 2020 8:22 am, edited 1 time in total.
|
Fri Aug 21, 2020 6:48 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
As a continuation of my previous post, I also accepted the Fibonacy challenge, and this is what I got for my CPU74 architecture: [Source code] Code: int fib(int x) { if (x > 2) return fib(x-1) + fib(x-2); else return 1; }
Code: #include "systemio.h" int main() { int iter, result; myprintstr("10 iterations\n"); for (iter = 1;iter <= 10;iter ++) { result = fib(24); } printnum( result ); putchar('\n'); }
Of course, I had to place the 'main' and 'fib' funcitons in separate files to prevent the compiler optimizing away the entire loop altogether. If both functions are placed on the same file, only one single call is perfomed to 'fib' from, regardless of the loop variable. By compiling these functions separately, the compiler does not have a clue about what possible side effects the fib function may have, so it will call it 10 times as per the source code. I have to do that to show a more comparable result. So this is the compiled code: [CPU74] Code: .globl fib fib: add SP, -4, SP st.w r4, [SP, 2] st.w r5, [SP, 0] mov r0, r4 cmp.lt r4, 3 mov 1, r5 brcc .LBB0_3 mov 1, r5 .LBB0_2: mov r4, r0 sub r0, 1, r0 call @fib add r0, r5, r5 sub r4, 2, r4 cmp.gt r4, 2 brcc .LBB0_2 .LBB0_3: mov r5, r0 ld.w [SP, 0], r5 ld.w [SP, 2], r4 add SP, 4, SP ret Code: .globl main main: add SP, -2, SP st.w r4, [SP, 0] mov &.L.str, r0 call @myprintstr mov 10, r4 .LBB0_1: mov 24, r0 call @fib sub r4, 1, r4 brncc .LBB0_1 call @printnum mov 10, r0 st.b r0, [-1L] mov 0, r0 ld.w [SP, 0], r4 add SP, 2, SP ret
# --------------------------------------------- # Global Data # --------------------------------------------- .section .rodata.str1.1,"aMS",@progbits,1 .L.str: .asciz "10 iterations\n"
Now, the results are the following: Code: /Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim 10 iterations 46368 Executed instruction count: 9096884 Total cycle count: 13160704 Elapsed simulation time: 17.5367 seconds Calculated execution time at 1MHz : 13.1607 seconds Calculated execution time at 8MHz : 1.64509 seconds Calculated execution time at 16MHz: 0.822544 seconds Program ended with exit code: 0 It's still significantly faster than anything else, and by a very long extend!. According to Roelth posted figures on the 10 MHz 8086, the CPU74 at just 8 MHz, is 10 times faster than the 10 MHz 8086. Even at only 1 MHz, the CPU74 is able to beat the 10MHz 8086 for all the tested compilers. That's a very impressive result, much better than I had ever imagined.
|
Fri Aug 21, 2020 8:21 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
To be fair, I think to compare to a 68000 would be a better choice. The intel cpu's have to acess variables from memory, so it is slower. Ben.
|
Sat Aug 22, 2020 8:49 pm |
|
 |
drogon
Joined: Sun Oct 14, 2018 5:05 pm Posts: 62
|
I tried the fib test too, but was too embarrassed to post the results... [1] The cintcode VM that the BCPL compiler produces is a bytecode for a load/store architecture which has a 2-deep register stack so almost every single operation is from memory to register or register to memory. Running the 32-bit VM on a 16-bit CPU with an 8-bit data bus is somewhat sub-optimal... -Gordon [1] ok, here it is: Code: Time: 37102ms Result was 46368 So 37 seconds to do 10 iterations of this calculation. The BCPL code is: Code: LET fib (x) = VALOF { TEST x > 2 THEN RESULTIS fib (x - 1) + fib (x - 2) ELSE RESULTIS 1 } And the bytecode (which I commented) looks like: Code: // Entry to: fib 20: L1: 20: L2 ; regB := regA ; regA := 2 21: JLE L3 ; Jump to L3 is <= 0 23: LM1 ; regB := regA ; regA := -1 24: AP3 ; Add local var 3 into A 25: LF L1 ; B := A ; Load addr of L1 into A 27: K4 ; Call function, reserve 4 stack locations 28: SP4 ; Store A into stack loc. 4. 29: LM 2 ; B := A ; Load -2 into A 31: AP3 ; Add stack 3 into A 32: LF L1 ; B := A ; Load addr of L1 into A 34: K5 ; Call function, reserve 5 stack locs. 35: AP4 ; B := A ; A := stack 4 36: RTN 37: L3: 37: L1 ; B := A ; A := 1 38: RTN
Note that when a function is called, the first argument is in A but also in stack location 3. 2nd argument is in stack 4, etc. Reg B is popped into A and stack 3 as part of the K call sequence.
I think the compiler is doing a good job of the code generation - the bytecode VM is more or less the "ideal" machine for this, but interpreting it on an 65816 is the crux here. If only I had a 32-bit CPU that could do the B := A operation internally in a single cycle for example... currently with 2 x 65816 LDA/STA instructions in 16-bit mode it takes 20 machine cycles (Which is a reason I'm lurking here, hello, and sorry for slightly hijacking the thread, I'm finding I like Retro-BCPL more than I like the 65816 so maybe one day I'll either find a CPU that's a better fit for this bytecode, or adapt some existing core, you never know!) -G.
|
Sun Aug 23, 2020 9:50 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
oldben wrote: To be fair, I think to compare to a 68000 would be a better choice. The intel cpu's have to acess variables from memory, so it is slower. Ben. I strongly believe that the 68000 processor will rank no better than the 8080 for 16 bit data. The 68000 processor uses many cycles to do its thing, from 4 to even more than 20 in some instructions. It has the advantage of having more general purpose registers than the 8080, which may help to reduce some memory access overhead, but at the end of the day, my guess is that there's should be not that much difference (on algorithms performed with 16 bit arithmetic) I think I have posted this elsewhere, but there's a lot of reasons why the CPU74 is so fas, here's a summaryt: The CPU74 is designed as a Harward arquitecture, pure Load/store architecture, with enough registers to avoid a lot of memory accesses. Instruction fetching and initial decoding of the next instruction, happens simultaneously with the execution of the current instruction. Thanks to that: - Most instructions are executed in 1 cycle. - Data memory reads of writes, use 2 cycles. - Unconditional jumps and conditional taken jumps use 2 cycles, - and calls and returns take 3 cycles. On the other hand, the instruction set is very optimised for compilers to take full advantage of it, with instructions such as: - conditional moves and selects that help to avoid branching, and promote the insertion of speculative code - careful selection of the instructions that depend on status register flags, which gives the compiler freedom to reorder instructions, while the status flags might be used one or two instructions later (This is something that, surprisingly, the original designers of CISC processors failed to understant. The VAX-11 for example, failed completely to get this right, however he 68000 instruction set does a much better job in this regard) In summary, it's the combination of the Harvard architecture, the LLVM compiler, and the very compiler friendly instruction set that makes it so fast. To me, the most interesting benchmark against my CPU74 architecture would be the original ARM-V1 from 1985, or even the ARM-V2 processor at comparable clock frequencies. This is a link to the description of the original architecture https://en.wikichip.org/wiki/acorn/microarchitectures/arm1. In effect, the CPU74 borrows a lot of the ideas of the ARM1, and they are architecturally very similar, except that the former is Harvard architecture with simultaneous fetching and execution (exactly 1 cycle per basic instruction), and the later is a 3 stage pipelined Von Newmann architecture (effectively only requiring 1 clock cycle for most sequential instructions). Among the available processors before or up to 1985, I'm pretty sure that the only one that would beat the CPU74 is the ARM1 (at the same clock frequency). But as noted before, it would be really nice to have actual data with known algorithms to make the comparison. I would really like having that... Also it is interesting to note that the ARM2 processor was released in 1987 with a clock frequency of only 8 MHz, and I believe that should have been a beast compared with anything else available at the time of the same size. Still, I suspect that one possible issue of the original ARM1 at the time, was the availability of quality compilers that would really know how to squeeze the best of it. Joan [Edit:] I just found this list of the Instructions per second data for different processors: https://en.wikipedia.org/wiki/Instructions_per_second As a mater of comparison the fibonacci test execution for the CPU74 resulted in 9096884 instructions executed in 13160704 cycles, so that's 0.6912 instructions per cycle in that particular case. This means that it would give 5.53 MIPS at 8MHz or 11,06 MIPS at 16 MHz, versus the average 4 MIPS of the ARM2 at 8 MHz. But of course the ARM is 32 bits, this is just raw instruction execution data, and for specific algorithm benchmarks the compiler and the instruction set have a strong influence.
|
Sun Aug 23, 2020 10:35 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
I speeded up my cpu by having a faster clock. Looking at my CISC design I found I could make more decoding don't cares and simplify to 2 logic levels to the rom lookup tables. Rom tables are 20 x 256 bits. I can get a 420 ns alu cycle using ttl. 74ls and a few 74S for a 20 bit alu. That is the best I can design TTL to, on paper, or FPGA emulation. The next project may be a 24 bit register file cpu, if I find dual port memory. Ben.
|
Sun Aug 23, 2020 8:46 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Time to bump the thread. I have a new ISA for my cpu. It runs at 5.3 Mhz ~ 1976. 16 bit bus, 32 bit ints. I get ~ 15 seconds for the sive. Int size is 32 bits.
|
Sun May 09, 2021 8:11 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Probably the wrong thread - this one is called "Some C compilers"
|
Sun May 09, 2021 8:54 pm |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|