Compiling a language to C#

Several people have written me to say that they're writing their own language (let's say 'X') and they compile their language to C# and then compile C# to IL. This is instead of directly compiling X to IL. This can be attractive because:
1) using C# constructs may simplify the code-gen for X. For example, it's easier to emit an 'if (..) { ... } else {... }' then the raw IL instructions.
2) it may simplify semantic analysis by letting X piggy back on top of C#. X could generate C# code like "System.Console.WriteLine(o);" without necessarily determining the type of 'o'. I personally think this is cheating, but it can be a nice shortcut.

That's great, but it introduces some issues for debugging since you want to debug it at the 'X' source level and not the C# source level.   Assuming language X is sane, you should not have to write your own debugger. If you get the PDB right, you should be able to leverage the full power of an existing debugger (such as Visual Studio) to get a reasonable debugging experience.

Here are some things to pay attention to:

1) Use #line for source-line mapping: The biggest thing is that you need to get the IL-to-source map (sequence points) correct.  When you compile X --> C# --> IL, the C# compiler by default will emit sequence points to map the IL back to the C#. You can use
C#'s #line directive to providing your own mapping, which lets you map the IL back to 'X' (or any other source file). This is also great for code generators.  Note that each #line can specify its own source line and file, and thus a single function can be mapped back to source lines from multiple files.

This alone will solve many problems, including making managed breakpoints, stepping, and set-next-statement, work.
This sequence points are stored in the PDB. I have a tool that converts a managed pdb into a xml file.

 

2) What about locals?   Local variable names are stored in the pdb. C# doesn't provide a way to let you override the names of locals. Thus you'll need to pick your local variables in C# such that they map well to any local variables in 'X'. This may require clever codegen. Considering the following:

    2a) One problem is that sometimes a local in 'X' may be a reserved keyword in C#. You can get around this by using '@' lexing rule. Eg, you can say this in C#:
            int @int = 5;   // declares a local var named "int"
    2b) Adopting a coding convention to decorate all 'internal' locals to make it clear to end-user that they don't map locals from 'X'. (Note that double-underscore is reserved). CS uses variables with names like  'CS$1$0000' for this purpose.

3) What about callstacks? As I mention here , #line will affect the source-to-IL maps, but it won't affect the callstacks. That's because the callstack is based off metadata and not symbols (That's why the StackTrace class can work even without pdbs). Thus to have a reasonable callstack in the debugger for 'X', the X--> C# mapping must be intelligent. You can use '@' on function names too.  If you're generating C# code for a function Foo() in X, try to generate it as a single C# function also called Foo(). If you need to generate multiple C# functions, consider calling them Foo_1() and Foo_2().
Technically, since the source-mapping is arbitrary (as defined by the pdb), the function name in the callstack and the source mapping don't have to match up. Such a mismatch is bound to confuse end-users!

4) In general, use debugger-friendly code-gen.  Look at the code-gen for anonymous delegates for an example of how debugger-friendly code-gen can make a language construct more debuggable. We didn't provide any new debugging support for anonymous delegates, yet friendly codegen means end users still have a good experience.  Note that most compilers have a /debug switch for generating explicitly debuggable code. 
Sometimes this is as easy as picking good names. Sometimes it may be more complicated. Look at C# yield as a moderate example.
 

These issues are related to issues I raised when I explained how to add debugging support for an arbitrary state machine.
If I can think of more techniques, I'll try to come back here and update this list.

Comments

  • Anonymous
    October 01, 2005
    Hello,

    If it is within your knowledge, I'd like to know how I may write and compile my own language with C#. I know you can do such with Assembly, but I'm in no way good with Assembly language.

    Thanks,

    badguy219@NOSPAM_yahoo.com
  • Anonymous
    October 01, 2005
    bg219 - Can you clarify what you're asking?
    Eg: Are you asking:
    1) how do you write a compiler in C#?
    2) how do you write a compiler which translates a language into C#
  • Anonymous
    October 02, 2005
    I mean I would like to know how to write my own language in C#. Even something really simple would be fine. Any ideas on making such a language in C#?
  • Anonymous
    October 02, 2005
    You're in luck. C# the CLR are great for writing languages. some links:
    1) I wrote a C# compiler in C#. Full source is available. See here for details: http://blogs.msdn.com/jmstall/archive/2005/02/06/368192.aspx

    2) IronPython is a Python compiler / interpretter written in C#. Full source is also available. See http://www.gotdotnet.com/workspaces/workspace.aspx?id=ad7acff7-ab1e-4bcb-99c0-57ac5a3a9742.

    3) Reflection.Emit is a set of class libraries that you can access from C# to generate IL. See http://blogs.msdn.com/jmstall/archive/2005/02/03/366429.aspx
  • Anonymous
    October 02, 2005
    Yes, thank you very much Jmstall. However, this compiler.... Will it compile a language made in C# as well? Also, do you have a tutorial or article about making a language in C#? Some help would be appreciated.


    Thanks!
  • Anonymous
    October 03, 2005
    bg219 - I'm confused again as to what you're asking.

    Compilers have 3 qualities:
    1) the target input language. What language do they actually compile? Eg, CSc.exe (Microsoft's C# compiler) takes in C#. cl.exe (MS's C++ compile) takes in C++.

    2) the target output. What does it compile it to? Most compilers will target some stand alone executable. CSC.exe produces a .NET exe (exe containing IL that runs on the CLR). CL.exe produecs a win32-exe (runs without the CLR).
    This blog post points out that a compiler could actually produce C# (which may be easier to produce than targetting an .exe directly), and then use CSC.exe to convert that to an .exe.

    3) What language is the compiler itself implemented in? This is indepedent of the answers to #1 and #2! CSC.exe happens to compile (C# --> .NEt exe), and is implemented in C++. Blue (my compiler above), compiles (C# --> .Net exe) but is implemented in C#.
    You could write a compiler in ML that compiles (C# --> .Net exe)


    Given this background, I'm not sure how to interpret your question?
  • Anonymous
    October 03, 2005
    The comment has been removed
  • Anonymous
    October 03, 2005
    The comment has been removed
  • Anonymous
    October 03, 2005
    The comment has been removed
  • Anonymous
    October 03, 2005
    Sorry for the delay, I have to thank you for this article, it really helps.
  • Anonymous
    October 04, 2005
    Roman - great! I'll update it as people raise new issues.
  • Anonymous
    February 27, 2008
    We need some customer feedback to determine if we fix a regression that was added in VS2008. Any language