Thursday, 3 December 2015

Self hosted C - breakdown

I did it. It wasn't easy, but I did it. My C Compiler can compile itself. Even though it still has holes in functionality and obvious bugs, It gives me a funny sense of pride that my compiler can now be used to improve itself. I consider it a significant milestone, and this post shares an overview of what was involved.

Lets look at the breakdown...

The self hosting commit:

The final commit to self hosting was support for C va_lists, just enough to compile the code used in the compiler itself.

Timeline:





~ 188 days of self directed work.


Commits: 

Approximately 380 commits, though probably more, because I discarded lots of work in local branches. I estimate a commit to be 20-30 minutes of work on average, so that translates to about 4 work weeks of hard work actually coding.

Punch card:

 





I worked whenever I could take a break, including evenings and lunch times. Sometimes I couldn't sleep and would still be hacking away at 3 am. Once I woke up at 6 am, probably to fix a bug that was giving me nightmares. 

Lines of code:

5902 lines of code currently, but will grow. My original goal was a complete toolchain in less than 15k lines of code, and I think I have still have room to spare for an optimizing backend and assemblers.

Code quality/clarity:

I think I have done a good job, though there is always room for improvement. I really wanted to make something anybody could understand. It is a matter of opinion, but look for yourself.

Compare my for loop parser:
https://github.com/andrewchambers/c/blob/7775638eeb241979d2568ec699911bc797f7bb6e/src/cc/parse.c#L1279

To the equivalent clang for loop parser:
https://github.com/llvm-mirror/clang/blob/08e3bfe1f5d00ebe115c2f2e44a93e396d59177e/lib/Parse/ParseStmt.cpp#L1474

To the equivalent gcc parser:
https://github.com/gcc-mirror/gcc/blob/e01e62c7a9ae012337243c86e1e1a2e0041f9895/gcc/c/c-parser.c#L5596

To the equivalent Tiny C compiler parser:
https://github.com/andrewchambers/tccmirror/blob/d6d7686b608c4b7cd88877b30579ca2346e5d284/tccgen.c#L4526

Motivation levels:

I am unfortunately the type of person who constantly starts new projects and stop before they hit a major milestone. I eventually reached a point where I felt like a failure, and that I couldn't finish anything. Overcoming this can be a struggle, but in this case I challenged myself to not be a quitter. Whenever I hit a brick wall and wanted to give up, I told myself that this barrier would stop someone else, but it won't stop me. 

My motivation levels did drop at times, but I picked myself back up every time to reach this milestone.

The future:

My crazy ambition was to write the cleanest C compiler that could be used for an all C operating system like plan9 or Openbsd and there is still lots of work to do to reach that level of sophistication. Real OS support will require funding or many more dedicated code contributors.

Conclusions:

This milestone really made me reflect and appreciate what some of the early programming language pioneers went through, the first self hosted languages really are something special.

Let me know if you enjoyed this work and want it to continue.

Thursday, 29 October 2015

A hidden gem

I thought I would post a wonderful C compiler which someone showed me recently and I think it needs some more attention - neatcc

This compiler is really a great demonstration of software engineering and I can't credit the author enough. It comes with its own tiny neatlibc, its own static linker neatld and it's own assembler.

Wow!

Sunday, 20 September 2015

Compiler Warnings Considered Harmful

In this post I would like to make the argument that compiler warnings are bad. This may seem like a crazy thing to say, but bear with me while I make the case.

There are two times when a compiler is being used, the first is when the developer is writing the code, the second is when an end user or packager is building the package from source. The second use case is actually done far more often for popular software, but is seriously under served by compiler writers. For packagers, compiler warnings do a few things:
  • Break the build when -Werror is enabled and different compilers add new diagnostics.
  • Train people to ignore messages from the compilers.
  • Waste CPU time by checking things the person has no ability to fix anyway.
  • Look ugly.
For the programmer, warnings are actually useful, but misplaced. The compiler's job is to turn code into assembly/bytecode/whatever, and nothing else. The compiler should only stop on an invalid program. We actually already have tools designed to warn the programmer of bugs, and those are called linters and static analysers.

My proposal is simple. Shift all warnings from compilers and into code analysis tools that are as easy to run as the compiler itself. That way programmers get good warnings and our compilers can be faster and less annoying for everyone else. The Go programming language designers already realized this, with a compiler that emits no warnings, and excellent tools for catching bugs (https://golang.org/cmd/vet/, https://github.com/golang/lint).

Whatever happened to do one thing, and one thing well. Compiler warnings are bad design.

Saturday, 12 September 2015

Calling conventions are hard - Fuzz them!

I am busy implementing the C AMD64 calling conventions in my C compiler suite and have a topic worthy of a post. It is about testing the C ABI (How C programs layout structs and perform function calls).

The old Linux C x86 ABI was relatively simple, to call a function you pushed arguments onto the stack in reverse order and you are done with it. Unfortunately for me, most people now use AMD64 processors, so that is what I need to target first. The AMD64 ABI designers apparently didn't like simple or well specified things (presumably because it would make software engineering too easy), so they created this document to describe the way C structs/arguments are laid out in memory and registers among other things.

I have a few problems with the document, such as a lack of examples, lack of pseudo code for the classification algorithm, and underspecified edge cases. However, regardless of whether my complaints are valid or not, I still need to implement the thing correctly before my compiler can self host. I need a good way to test my implementation...

Enter ABIFUZZ

We have a few C compilers like gcc and clang we can test against, but hand writing interesting test cases is a chore, so I decided to automate it. The general steps are quite simple:

  • Decide how many arguments you want.
  • Decide the types of those arguments.
  • Generate values for the arguments.
  • Decide the return type.
  • Generate a return values.
  • Generate code to do the call and check the values.
The tool is located here here and took an afternoon to write. Here's the end result:




The final step is to write a script to split the caller and callee into two files to test interop when each is compiled by different C compiler.

Bugs found:

http://savannah.nongnu.org/bugs/index.php?45950
https://github.com/andrewchambers/c/issues/14
https://github.com/andrewchambers/c/issues/13




Monday, 7 September 2015

A Smaller, Better compiler suite.

You should be able to get a C compiler, assembler, linker and libc for any supported target in less than 30 seconds just by typing make... Or at least thats my plan.

I have started work on a BSD licensed simple but powerful C compiler suite here https://github.com/andrewchambers/c (A C port/continuation of my now frozen Go based C compiler). After a few months of work in my free time the compiler is building some non trivial test cases on amd64 Ubuntu, but no real software.

I encourage you to clone it and have a play around.

Some general goals I have in mind are:
  • Compile times that are 2 - 5 times faster than gcc or clang. TCC is 10 times faster, but does not have text assembly or an AST.
  • Be one to two orders of magnitude smaller than gcc and clang/llvm. For every million lines of gcc code, we could have ten thousand line of code.
  • Emit assembly that has performance at least equal to tcc. This is a modest performance goal so we don't focus prematurely on this over compatibility.
  • Have the whole system build from source in less than 30 seconds (probably much less) on a modest desktop machine or even low end arm systems.
  • Be zero config compatible with the excellent Musl libc on Linux.

To answer why I would start a new compiler suite from scratch, perhaps the following will resonate with you.


GCC and Clang:


GCC is large and complicated and non standard. Generally porting it is difficult and out of reach of hobbyists. Building these compilers from source requires 20 minutes to many hours. LLVM and Clang suffer from the same issues and they have added CMake to the list of things I can't get behind.

For most of my use cases I question the need for hundreds of thousands of lines of optimizer code. I think the Google Go toolchain + stdlib's 30 second build proves this nicely. I would prefer a simple C compiler written in C, to a complicated C++ compiler written in C++ supporting all of C++ with C on the side.

Bootstrapping these cross compilers with working libc's is so complicated/arcane there are dedicated tools like buildroot and crosstool-ng just to manage the complexity.

Both these compilers also seem to require more ram and cpu to self host than modest hardware or emulators like qemu can provide. This is actually a serious barrier to overcome when trying to work with many platforms.

TCC:


TCC is extremely fast and small, I generally use tcc as my primary C compiler when I don't want to deal with GCC. I have two issues with this compiler.

I don't think I am alone in saying the code style is terse, hard to understand. Perhaps it was written with speed alone in mind, perhaps the lack of AST has allowed some ugly hacks into the code base, or perhaps my taste is just different. I would encourage you to make these judgement call for yourself by comparing code.

The major limitation however, is that because TCC emits binary directly with no text assembly, it is much harder to use with some hobby systems which have existing assemblers. This was the main deal breaker for me.

PCC and 8CC:


PCC is old, mature, and generates good code and can build real programs. 8cc is simple and self hosting with a small and nice code base.

These are the best candidate's so far to meet my goals. All I can really say is I think we can take the best ideas from these projects, and have no problem sharing code/design in order to create the best system possible.