Last visit was: Thu May 01, 2025 4:26 pm
|
It is currently Thu May 01, 2025 4:26 pm
|
Author |
Message |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
. (well, the air turned blue there for a while as I lost a previous draft of this...) Over on stardot, davidb just found an upcoming presentation for a C compiler project, and that led me to an interesting writeup about another project, and I thought it might be good to have some collected links to C compiler projects, inevitably in various states of progress. First then, Rui Ueyama's interesting story of building a C compiler over 40 days. Even an experienced C programmer might find the language is more complicated than they thought. Second, I should mention two existing threads here: - C64 Compiler by Rob, a more-than-C compiler targeting several cores, including FT64 and OPC6 - Ide68k C compiler an IDE found by Dajgoro And now, here are some links to repositories: - bbc-c "C compiler for the BBC Micro series of micros" mentioned above (written in python) - 8cc, the 40-day compiler mentioned above. - LCC: A Retargetable C Compiler: Design and Implementation- TCC: Tiny C Compiler by Francis Bellard - cc65 of course, for the 6502 - Puppeh's gcc-6502 work in progress (and support tools including a libc) - bcompiler, bootstrapping "a tiny compiler for a toy programming language somewhat reminiscent of C and Forth" - C in four functions as an educational demonstration ( good discussion here.) - legacy cc - "earliest versions of the very first c compiler" - selfie, an "educational software system of a tiny self-compiling C compiler, a tiny self-executing MIPS emulator, and a tiny self-hosting MIPS hypervisor." Once upon a time, there was going to be a huge space-trading-exploration online multiuser game which would involve programming the ship's computer. The architecture was called DCPU-16 and was a bit similar to a 6502. Several people started working on compilers for this architecture - their work might be interesting as starting points. - dcpu16 target for LLVM (my fork of a vanished project) - LLVM backend for dcpu-16 processor - a spiritual successor to the above - DCPUB, "a language similar to B" - TenC, "A high-level, C-like language for the DCPU-16" - py-dcpu-c-compiler C to DCPU-16 compiler written in python. - DCPU-TCC "Super Awesome DCPU C Compiler based on TCC" - d16cc "A C compiler for notch's DCPU16" - 0x10c-Compiler "Effort to create a compiler for 0x10c"
|
Thu Aug 31, 2017 10:34 am |
|
 |
Xark
Joined: Thu Aug 24, 2017 8:52 pm Posts: 4
|
|
Tue Sep 05, 2017 8:41 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Thanks Xark - that looks pretty interesting, as it targets x86, MIPS or trillek RISC, and is self-hosting.
|
Tue Sep 05, 2017 8:43 pm |
|
 |
Xark
Joined: Thu Aug 24, 2017 8:52 pm Posts: 4
|
You are very welcome. I will also mention that a friend of mine Valentin ported this compiler to his custom Sweet32 FPGA CPU (RISC-ish). It worked quite well. Code (for VHDL CPU and compiler port) are at https://github.com/Basman74/Sweet32-CPU-Xark
|
Wed Sep 06, 2017 4:10 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Sweet!
|
Wed Sep 06, 2017 7:52 am |
|
 |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
vbcc is well worth a look - I've been creating a new backend for it recently, and the interface is really nice. The compiler itself is mature and well-tested, having been widely used on mc68000-based platforms in the 90s. There's even a "generic risc" backend which can be used as a starting point for a new one. http://www.compilers.de/vbcc.html(The license doesn't allow commercial use or distribution of modified versions of the compiler without written permission.)
|
Sat Dec 07, 2019 9:13 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi, for my new Kobold computer I would like to have a c compiler (but that might be a long road). In the past I made a Pascal compiler in assembler for the Z80, and over the last few years I did spend some time making a web-based C compiler, but I did not yet finish it. When modifying an existing C compiler to support a new instruction set, the general approach is to choose a C compiler and change its code generator. The problem is, that after this is done, you are stuck with the C compiler that was chosen. Also, making a code generator might not be easy, depending on the interface between the parser and code generator of the chosen compiler. It might also be that the parser does optimizations that are not effective for your instruction set, or even might make things worse. It would be good to have a standard interface between parser and code generator, especially for hobbyists and single-person companies. I have a proposal for this interface. It is based on JSON syntax, so also well equipped for a web-based Javascript compiler. Here it is: Format for C compiler code generatorThis is a first draft, and all details are not yet in place. I hope that it is clear that after a single JSON-read action, you will have the whole program (or a single function) in memory, as a structured tree (AST). Perhaps a few actions, that were previously done by the parser, must now be done in the code generator. Also, this approach makes the parser totally independent from the target CPU. Please let me know what you think !
Last edited by roelh on Fri Dec 20, 2019 8:33 pm, edited 2 times in total.
|
Thu Dec 12, 2019 3:10 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
I think keep small. Is self compiling on the host machine possible? What about floating point and C libraries?
|
Fri Dec 13, 2019 12:38 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
roelh wrote: Hi, for my new Kobold computer I would like to have a c compiler (but that might be a long road). [...] Also, this approach makes the parser totally independent from the target CPU.
Please let me know what you think ! Hi Roelh, The conversion of C code to JSON and back to a memory graph in the way you propose, certainly would help to avoid the C parser and would enable custom backends to start from that. However, in my opinion it is a relatively small step that still requires a lot of custom work on the target code generator. Although the conversion from plain C to JSON is an interesting exercise, it is only a relatively direct transformation that does not take into account code semantics, and it is highly dependent on the language (C language in this case). In other words, it is not anything else than encapsulating plain C, into a JSON string, and of course the target code generators would still require a JSON parser (although this is quite straightforward) In my opinion, it would be more interesting to produce a canonicalised, medium-level, intermediate representation form (IR) instead, which would be both language independent and target independent. This is the approach taken by modern compilers such as gcc and llvm. The IR form is largely target independent but fully expresses the semantics of the original C code (or any source language) As I worked on llvm, I can post an example from what I learned. Consider the following C code and equivalent llvm IR representation: C source code Code: void testfunction(int x) { if (x > 0) { printfunction(x); } } IR Representation Code: define void @testfunction(i16 %x) { entry: %cmp = icmp sgt i16 %x, 0 br i1 %cmp, label %if.then, label %if.end
if.then: %call = tail call i16 bitcast (i16 (...)* @printfunction to i16 (i16)*)(i16 %x) br label %if.end
if.end: ret void } As it stands, the llvm produced IR code is expressed as a human readable plain text file, which requires a dedicated IR parser on the backend implementation to create a memory graph (DAG) from it. But that's just an example based on what llvm offers: the same IR can be generated in a much more efficient binary form that backends can use in nearly the same way. Nothing prevents you from creating your own JSON encapsulated version of it. The point that I want to show is that in my opinion, starting from IR (or a JSON encapsulated version of it) to produce a custom target compiler, is a better approach than starting from JSON encapsulated C. Once the IR code is available, creating a backend from it (or code generator as to refer to it), is easier because the IR is no longer tied to C source code, it may already incorporate a number of "target independent" optimisations such as at least constant expression folding and dead code removal, which targets do not need to care about. Since the IR already resembles assembly language, the backend implementation is mostly a matter of applying the required transformations and optimisations to it, and convert it into actual target CPU instructions, which require fewer steps than starting from C. [EDIT]: Just to complete this post, I want to post the resulting assembly code from the above IR that my llvm backend implementation for the CPU74 architecture produces: CPU74 Code: .globl testfunction testfunction: cmp.lt r0, 1 brcc .LBB0_2 call @printfunction .LBB0_2: ret In this case, the transformations to the IR just involved the reversal of the condition code to skip the "if.then" code and jump directly to the "if.end" label when the reverse condition is met. Also, since the CPU74 architecture does not directly support "signed less than or equal", the reverse condition is canonicalised to a "signed less than" condition to conform with the available instructions.
|
Fri Dec 20, 2019 8:50 am |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi Joan, thank you for your reaction. But I think I made something not very clear. The proposed JSON format has been processed by the parser. All checks that a C compiler should do are allready done. So the JSON format is (almost ?) on the same level as the IR format that you mention. As I do several things web-based (like the assembler), the code generator would at first probably be made in Javascript, and that ignited the idea to have JSON as intermediate format. I already had a half-working C compiler, strangely enough programmed in PHP, that provided an intermediate format. This week I've been busy translating this intermediate format to the JSON format, and thats 75% done. A javascript code generator will then generate assembler. In the end, I would like to have the compiler running on the homebuilt CPU itself. Both the PHP parser and the Javascript code generator must then be converted to C, and after the first compile this should theoretically be selfhosting. Also another parser can be used (I like the 8cc compiler because it seems quite small and simple) but then there is the question how to interface to the IR code. That's a question that you have solved for the LLVM compiler, but I have the feeling that the LLVM compiler is much too big to run on a small computer. My current PHP translates C as follows (example): Code: int ar[10];
void printfunction (int u, char c) { }
void f(int i){ char t; i = ar[i] + 1; }
void testfunction(int x) { if (x) { printfunction(x, 'a'); } } Code: {"function": "printfunction","par": { "u": "int","c": "char"} ,"return": "void", "mod": [ ],"body": [ { "var": { } } , ] }
{"function": "f","par": { "i": "int"} ,"return": "void", "mod": [ ],"body": [ { "var": { "t": "char"} } , [ "i","=",[ [ "ar",":","i"] ,"+",1] ] ] }
{"function": "testfunction","par": { "x": "int"} ,"return": "void", "mod": [ ],"body": [ { "var": { } } , { "if": "x", "then": [ [ "printfunction",[ "x",97] ] ] } ] } It delivers a list of functions, and also a list of global variables and types (both not shown).
|
Fri Dec 20, 2019 8:19 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Hi Roelh,
Yes LLVM is a monster. Definitely only suitable for cros-compiling purposes as long as small processors are concerned .
Back on subject, I don’t see a marked difference between the C code and your JSON string, at least on your proposed example. I would be interested to see what’s the output if you replace if(x) by if(x!=0) on your example. The C code semantics would be identical, so the output should be identical too, right? however I assume that the output will just include the ‘!=‘ embedded in the JSON string as it is present in the C code (please correct me if I’m wrong). This is not what a canonical IR representation would do.
I’m also unsure if I managed to fully explain the implications of a IR representation. I’m not dismissing your approach, just saying that it appears to be a very different one. My point is that your JSON string is still very high level (and similar to C) as it does not include actual relatively low level instruction representations that would resemble assembly language. Your output only appears to work at the stage of the parser, not the compiler. For example, on all processors the if statement requires a comparison or similar instruction to tell whether a branch should execute or not. Just look at the IR example I posted earlier. It contains such a compare instruction and assembly language like labels to jump to. The following step to transform IR to target assembly code is minimal in this particular example because the IR is already relatively low level. However, such low level representation appears to be missing on your JSON representation, which would require a different kind of work on the target code generator (?). I hope I made myself clearer now.
In any case, please keep the good work, as I’m already learning a lot from your previous builds!
Joan
|
Fri Dec 20, 2019 11:27 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi Joan, It seems the LLVM IR output is some generalized assembly output. This probably assumes a load-store type cpu with a lot of registers, where all operations are between registers. This is not suitable at all for small cpu's like Kobold. Kobold has 6 available registers (PC and WP can not be used freely), only two can address memory, and there are only a few possibilities for register-to-register operations. To do proper code generation for simple processors like Kobold, the code generator module has to have a high-level view. That is what the JSON structure provides. Ok, so I made a simple C compiler (2400 lines PHP) that outputs the JSON intermediate language. And I made a Javascript code generator (600 lines) that produces assembly for the Kobold K2. Both the C compiler and code generator are not yet finished. You can try the compiler-generator combination online at http://www.enscope.nl/compiler/. It contains links to my C language manual, to the JSON language specification, and to the Javascript code generator. There is an integrated editor with syntax highlighting and matching parenthesis indication. The C-compiler starts with a demo program for function call, function pointers, pointers, structures, arrays. I also handles IF statements. You can paste the result in the Kobold assembler http://www.enscope.nl/simas_kobold2/ and have it assembled. It will not run yet, because there is no start-up code and there are no support functions. The code generator can at this moment only handle expressions that do not exceed the available registers. There are also some other limitations that must be solved.
|
Sat Dec 28, 2019 2:25 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Hi Roelh,
That's pretty awesome, particularly because it's all online ! (and must have been a lot of fun, too)
The LLVM IR assumes 'infinite' number of registers, or rather, every instruction line that produces a result is assigned to a new variable (not actually a physical register). It's the responsibility of the backend to use the available physical registers, or to create memory spills if not enough are available, or to implement the memory addressing modes. It's also true that the more available physical registers and the more general purpose they are, the less code needs to be overridden from the base classes. So yes, that means that generating code for small cpus can get harder with LLVM than starting from scratch by parsing a C file and generating code right from it. On the positive side, if the cpu has at least some minimal set of features which LLVM is happy with, it is possible to get high quality, very optimised compiled code.
|
Sat Dec 28, 2019 6:11 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Hi Roelh, just as a matter of fun, I picked the code you posted as the example for your compiler (with a minor syntax difference) and compiled it using my CPU74 LLVM backend, and this is what I got: Code: int sum (int num1, int num2) { return num1+num2; } int (*f2p) (int, int); int n;
void f() { int op1, op2; op2 = sum(op1,n); f2p = sum; //Calling a function using function pointer op1 = (*f2p)(10, 13); }
// demo structure member, array element, and if statement
char ar[8];
struct tt { int r1 ; int r2; };
int i;
void test(){ struct tt* p; if (i==5) { i = p-> r2; } else ar[6] = i; }
CPU74 Code: # --------------------------------------------- # sum # --------------------------------------------- .globl sum sum: add r1, r0, r0 ret
# --------------------------------------------- # f # --------------------------------------------- .globl f f: mov &sum, r0 st.w r0, [&f2p] ret
# --------------------------------------------- # test # --------------------------------------------- .globl test test: ld.w [&i], r0 cmp.eq r0, 5 brcc .LBB2_2 st.b r0, [&ar+6] .LBB2_2: ret
# --------------------------------------------- # Global Data # --------------------------------------------- .comm n,2,2 .comm f2p,2,2 .comm i,2,2 .comm ar,8,2 Well, as you can see, most of the original C code is nowhere in the assembly output because it is never executed or it does not produce an useful result ! So I made some changes to the original code to get some actual code generation actually happening: basically the f() function now returns a value, and the struct pointer is passed as an argument on the test() function: Code: int sum (int num1, int num2) { return num1+num2; } int (*f2p) (int, int); int n;
int f() { int op1, op2; op2 = sum(op1,n); f2p = sum; //Calling a function using function pointer op1 = (*f2p)(10, 13); return op1; }
// demo structure member, array element, and if statement
char ar[8];
struct tt { int r1 ; int r2; };
int i;
void test(struct tt* p){ if (i==5) { i = p-> r2; } else ar[6] = i; } The result is this: CPU74 Code: # --------------------------------------------- # sum # --------------------------------------------- .globl sum sum: add r1, r0, r0 ret
# --------------------------------------------- # f # --------------------------------------------- .globl f f: mov &sum, r0 st.w r0, [&f2p] mov 23, r0 ret
# --------------------------------------------- # test # --------------------------------------------- .globl test test: ld.w [&i], r1 cmp.ne r1, 5 brcc .LBB2_2 ld.w [r0, 2], r0 st.w r0, [&i] ret .LBB2_2: st.b r1, [&ar+6] ret
# --------------------------------------------- # Global Data # --------------------------------------------- .comm n,2,2 .comm f2p,2,2 .comm i,2,2 .comm ar,8,2 Pretty good, isn't' it? (Both the call to the pointer function, and the address of the array element are folded as constant expressions, so there's no call to 'sum', and no indexed memory access to 'ar' (&ar+6 is a physical memory location, not an indexed addressing mode))
|
Sat Dec 28, 2019 6:49 pm |
|
 |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
roelh wrote: Ok, so I made a simple C compiler (2400 lines PHP) that outputs the JSON intermediate language. And I made a Javascript code generator (600 lines) that produces assembly for the Kobold K2. Both the C compiler and code generator are not yet finished. Wow - that is very cool! Nice work! Quote: The code generator can at this moment only handle expressions that do not exceed the available registers. There are also some other limitations that must be solved. I'll watch with interest as this progresses - I'm especially interested in how much trouble you have dealing with object lifespans and register spilling. Like joanlluch I couldn't resist running your test program through my own pet project's toolchain - in my case it's a vbcc backend for my EightThirtyTwo CPU. VBCC has a function that backend writers can use for debugging, which dumps a human-readable version of its intermediate representation - so first, for comparison, here's what it looks like for your program: Code: sum: allocreg r2 add int M0+-8(FP)(num2),M0+-4(FP)(num1)->M0+r2(0x178ff40)[S] set-return int [tgt-addressing-mode] freereg r2
f: nop push int M0+_n(n) size=4 push int M0+0(FP)(op1) size=4 call function M0+_sum(sum) size=8 => sum move pointer #M0+_sum(sum)->M0+_f2p(f2p) size=4 push int #I13 size=4 push int #I10 size=4 call function M0+_sum(sum) size=8 => sum
test: nop compare int M0+_i(i),#I5 bne L5 allocreg r2 add-int-to-pointer int M0+0(FP)(p),#I4->M0+r2(0x1792110)[S] ptype=pointer move int [tgt-addressing-mode]->M0+_i(i) size=4 freereg r2 bra L6 L5 convert char M0+_i(i)->M6+_ar(ar) from int L6
The code itself probably looks ridiculously verbose - and yes, there's plenty of room for optimisation - but each instruction is only a single byte, so the code footprint's actually not that bad: Code: _sum: stdec r6 mt r2 stdec r6
li IMW0(12) ldidx r6 mr r2
li IMW0(8) ldidx r6 add r2
mt r2 mr r0
ldinc r6 mr r2 ldinc r6 mr r7
_f: stdec r6 stdec r6
ldinc r7 .int _n ldt stdec r6
li IMW0(4) ldidx r6 stdec r6
ldinc r7 .int _sum exg r7
li IMW0(8) add r6
ldinc r7 .int _f2p mr r1
ldinc r7 .int _sum
st r1
li IMW0(13) stdec r6
li IMW0(10) stdec r6
ldinc r7 .int _sum exg r7
li IMW0(8) add r6
ldinc r6 ldinc r6 mr r7
_test: stdec r6 mt r2 stdec r6 stdec r6
ldinc r7 .int _i
ldt mr r1
li IMW0(5) sgn cmp r1
cond NEQ
li IMW1(PCREL(l5)-1) li IMW0(PCREL(l5)-0) add r7
ld r6 mr r2
li IMW0(4) add r2
ldinc r7 .int _i
mr r1
ld r2
st r1
li IMW1(PCREL(l6)-1) li IMW0(PCREL(l6)-0) add r7 l5: #
ldinc r7 .int _ar + 6
mr r1
ldinc r7 .int _i
byt
ldt
stbinc r1
l6: # ldinc r6 ldinc r6 mr r2 ldinc r6 mr r7
|
Sat Dec 28, 2019 8:00 pm |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|