Writing a simple assembler

Hi,

(since asm conf. is dead, i post it here, maybe someone can help)

In my current project , i have to create i simple assembler for an 8-bit processor. Since i am a novice at language writing, I would like to ask if somebody can direct to some articles on this topic, or where to start looking at. Thank for help.

--
Alex
Reply to
Alex
Loading thread data ...

Alfred Arnold's AS is here

formatting link

This covers many microcontrollers.

and more complex, is Randall Hyde's HLA

formatting link

latticeSemi's downloads for the small Mico8 SoftCPU include an assembler source, but it is simpler than AS, which now supports Mico8.

-jg

Reply to
Jim Granville

Hey you have to get this book:

Language Translators: Assemblers, Compilers, and Interpreters by John Zarrella ISBN: 0935230068

used on amazon is around ~$4 and change.

I bought it to give me a heads up for a project I was working on at home ... I needed to write an assembler for a 8bit micro as well and needed to brush up on the different modules comprising such a task.

Its rich in theory and practical!!! (versus the "dragon book" which is all theory and no practical examples )

Alex wrote:

Reply to
samIam

My favoured tools are lex/yacc (flex/bison). Source code to my assembler for simple architectures is here (non-macro though):

formatting link
Notes here
formatting link

Reply to
toby

I am writing an assembler for a 16-bit soft core CPU I made for FPGA in VHDL.

I have a general idea of how I am going to attack this (I dont have any theoretical books).

I will outline my plans to you a bit later. Hopefully, people will offer some critique and we can learn together :).

Reply to
Isaac Bosompem

OK here is my general idea:

What I have decided to do was to split the assembler into classes. I figured this would be a good time to do something useful and advance my C++ knowledge and methodologies.

  1. Preprocessor Basically takes equates and macros and replaces them with their equivalent. This alone will require a pass across the entire file.

  1. File Tokenizer Once the preprocessor has finished, the assembly will start to take place. This respective class will take the file and divide it into tokens (with selected delimiters). Its sole purpose is to maintain the position in the file, memory associated with it etc.

  2. Global Hash Table

This portion was added to make my life easier. I decided that instead of parsing the data through text, which will make my job a living hell, I decided to use a global hash table to keep track of strings. This way I can just do a table lookup of the hashed token, look up type flags and determine what to do from there (assemble an opcode, report error, incompatible types, etc.)

  1. Opcode hash list

This portion really is the portion of the program which contains hash values for each of the opcode words (minus size suffixes), of course everything will be put in uppercase before its hash value is taken.

  1. Opcode assembler.

Takes an opcode from the hash list, parses parameters and assembles the respective opcode, based on the size of the variable (if operated), the register operands, etc.

This is my general idea, of course I havent started to code it yet as I am still planning it.

I do need an efficient hashing function, with a low probability of collisions (ideally). I will need to code to handle collisions.

Anyways I am open to new ideas hopefully you have some of your own to add.

-Isaac

-------------------------------------------------------------------------------------------------------------- I am an EE student looking for summer employment in Toronto, Canada area If you have any openings please contact me at isaacb[AT]rogers[DOT]com.

Reply to
Isaac Bosompem

One method I have used is to write out the syntax of your assembly language in BNF, then augment the BNF by appending to each rule the name of a function to be called when that rule is matched (ie fires). Let's call this augmented syntax BNF+.

Once you have the core BNF engine running, it can build the assembler for you (usually called a compiler-compiler, but this is assembler). You write the syntax of BNF itself, in BNF+. Feed this as input to the engine, and get out a set of control tables. Drive the engine with these, and it will read anything in BNF+ (eg your assembly language).

Besides the BNF engine, you need a symbol table (indexing method of your choice), and the back-end code generation functions.

You can do this without a tokeniser, by defining a token on BNF, & letting the main engine do the work. This is simple, but slow.

The following examples are not fully functional, but should give the flavour of the thing:

Fragment of a BNF+ syntax file:

  • rule elements semantic function file = {oneline} [eof] . oneline = line ending . NextLine
*-----------------------------------------------
  • Basic elements eof = #26 . eol = {ws} [comment] #10 . ws = wsc {wsc} . wsc = " " | #9 . digit = ("0".."9") . uphex = ("A".."F") . lowhex = ("a".."f") . upper = ("A".."Z") . lower = ("a".."z") . hexchar = digit | HexValueDigit uphex | HexValueUpper lowhex . HexValueLower letter = upper | lower . symch = letter | digit | "_" .

Top-level function of the BNF+ engine:

// Syntax "engine": the core of the system #define SP current.syntab[current.synptr] #define GETCHR {currch = getchr(&current);} #define NEXTCH {lastch = currch; current.fileptr++; \ GETCHR; if(currch=='\n') current.linum++;} #define GOTOCH(x) {current.fileptr = x; GETCHR} #define EXIT {if(init.filename) fclose(init.input); \ return(current);}

state engine(state init) { state current; // Local copy (we may want to back-up) state temp; // Workspace for recursive calls WORD rule; char error[100]; char currch;

init.depth++;

if(init.filename) // Open new file... { if(!(init.input = fopen(init.filename, "rb"))) { sprintf(error,"Fatal: cannot open file %s\n", init.filename); fatal(error); }

init.fileptr = 0; // Start at BOF init.linum = 1; getchr(NULL); // Force a re-read }

current = init; current.filename = NULL; // File dealt with GETCHR;

do { rule = SP; currstate = &current; // Expose it for logging

#if LOGSTATE { static int ctr = 0; char zz[8]; if((currch >= 0x20) && (currch = 0x20) && (lastch = (char)current.syntab[current.synptr+1]) && (currch 200)));

} while(1); // End the DO loop }

Reply to
David R Brooks

... snip ...

You are welcome to use hashlib, which is under GPL licensing, and was designed to fill just this sort of need. It is written in pure ISO C.

You could open one table and stuff it with the opcodes. Another could hold the macros, and yet another the symbols. The tables will all use the same re-entrant hash code, yet can have widely different content.

--
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson
More details at: 
Also see
Reply to
CBFalconer

If you know scripting languages like Perl or python, you can make something fairly quickly. I've made a simple Perl assembler script for my small FPGA-based CPU in about 200 lines of code.

For simple projects, a line by line translation works fine. Match each line against a set of patterns, translate mnemonic to opcode through a hash, and call subroutine to fill in operand bits. Use two passes. In the first pass, you build the symbol table (a hash), and in the second pass, you can generate the code. Or store all code in array, and dump it after second pass.

If you need more advanced features like macros, things become a bit more complicated, and then a properly written assembler, using lex/yacc maybe a better choice.

Reply to
Artenz

Well, that what i thought initially. I know Tcl and used it for parsing hdl code, so this seems to be an easy and fast solution for the given problem.

I just want to clarify few things - do I really need two passes (omit preprocessor for) in case I don't have conditional branches and jumps.

At the same time I was thinking about implementing such a feature as merging few sequential instructions provided they can be accomplished simultaneously. Obviously it is possible to declare just a new instructions, but this will not look that tidy.

Anyway, thank you for help. Alex

--
Alex
Reply to
Alex

The only reason I have two passes is to resolve forward branches/jumps. If you don't have those, you can use one pass. In my case, implementing two passes is nothing more than putting a 'foreach $pass (1..2)' loop around the main loop. I first read the entire source file in a array of lines, and dump generated code in an array of instructions. The first and second pass are identical, but during the first pass, unknown symbols will generate bad code. This gets fixed automatically on the second pass. After the 2nd pass, the array of generated code is written to the output file.

Reply to
Artenz

Thank you for your comment, you gave me a lot of food for thought. I think that in any case the main problem is to write an efficient hash function.

To make my life easier I was >

--------------------------------------------------------------------------------------------------------------

--
Alex
Reply to
Alex

OK, thanks.

By way how the translati> Alex wrote:

--
Alex
Reply to
Alex

Thanks, will try to find it. Dragon book would be really usefull in case I would like to create serious C compiler, otherwise it is too theoretical as you've said.

--
Alex
Reply to
Alex

What I did was make small subroutines for each of the different types of instructions. So, I have one for branches, one for arithmetical, etc.. Usually there are a few instructions in each type, and the hash translates the instruction into a base opcode value. The subroutine then takes that base value, and adds specific bits for the operands and such.

For instance, a pattern match in the main loop finds two words separated by whitespace, where the first word is in the %cond hash. If found, it calls the branch subroutine:

branch( $1, $2 ), next if m/^(\w+)\s+(\w+)$/ and $cond{$1};

The %cond hash looks like this:

my %cond = ( "beq" => 0x100, "bne" => 0x108, ... );

And the branch subroutine builds the opcode:

sub branch { my ($cond, $dest) = @_; my $addr = $labels{$dest}; my $base = $cond{$cond};

if( defined($addr) ) { my $rel = $addr - $loc;

if( $rel >= -6 && $rel

Reply to
Artenz

It certainly isn't the main problem. In fact it isn't a problem at all, such things are readily available.

--------------------------------------------------------------------------------------------------------------

Reply to
toby

at.

What do you already have experience with? Have you written a tokenizer before? Have you written recursive descent parsers (the easiest general form) before? Are you familiar with the syntax and semantics of 'productions'? Can you convert a left-recursive form to a right-recursive?

The basics of writing an assembler by hand (not through the use of, say, lexers and compiler-compiler tools) are first somehow first proceeding beyond just simple character scanning. A tokenizer is a very simple step to take and is almost natural in concept. For example, a tokenizer for this paragraph might simply break out words, periods, parentheses, and other special characters. With the central idea being the word which you might then use to parse sentences at a still higher level. But it also includes the idea of parsing larger concepts, such as a pseudo-op or a complete instruction.

A useful assembler may need to be able to handle forward label references, too. That may suggest to you a second pass through the source. If you support symbols, you'll need a table to hold those you've seen. That can involve hashing functions for fast lookup, or just simple linear scanning. Etc.

My first assembler was written in BASIC, supported symbols (a symbolic assembler), did not support macros, and took me from a Friday night through to the following Sunday afternoon to finish and get working. On Monday, I used it to assembly my first actual assembly program and run it on an HP 2116 CPU. To load the program and run it, I wrote more BASIC code using a BASIC interpreter to read the data file and load a memory array with it and then to jump into it.

At the time, I knew only a little about parsing and tokenizing but my instincts worked well. So it's not hard. Best thing is to just get started by looking at a few lines of example source code and thinking about what you would need to do by hand if you were to mentally assemble them. Work it out on paper.

Probably, what had helped me was something that was called CARDIAC. It was a paper computer (cheap) where you could write data and instructions on little erasable areas in numbered cells and there was a little paper ladybug you moved around by hand. Having had to act like a computer running code probably had put me in the right frame of mind earlier in my life. CARDIAC was late 1972 for me. My first assembler was in 1975.

I have a lot of possible references for you to read, but I'd need to know more about you to recommend any particular one. None of them are perfect for everyone and some can be very bad for some, even though they are excellent for others.

Jon

Reply to
Jonathan Kirwan

I have never understood the obsession with hash tables.

SO WHAT they provide O(1) search/insert times ... but they are FIXED in size ... and you often have to implement collision detection to get around soem problems.

A red-and-black tree is a better compromise ... a data structure that "grows" with the program run, and offers reasonable search and insert times (Olog(n)?)

There is no way to know how many labels/symbols exist in your asm source file ... and JUST choosing a hash table size and then reallocating twice that size whenever it fills up ... isnt a sound method of implementing a symbol table in my honest opinion.

For an opcode table (which is fixed in size as in you know the number of opcodes in the micro) a hash table makes more sense.

CBFalc> Isaac Bosompem wrote:

Reply to
samIam

Most 8bit micros have anywhere from 40 - 60 instructions (opcodes NOT opcodes+addressing formats)

O(n) searches on that set looks like a waste of cpu cyles

There are somethings you can ONLY do in assembly ... and on an 8 bit micro (with 64k address space) every byte counts

Reply to
samIam

If it's just "simple assembler" why not just use a list with O(n) access time.

Better yet, use a real programming language with a dictionary type (Python, Smalltalk, whatever).

--
Grant Edwards                   grante             Yow!  ... or were you
                                  at               driving the PONTIAC that
                               visi.com            HONKED at me in MIAMI last
                                                   Tuesday?
Reply to
Grant Edwards

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.