The second type of functions being supported are generic or multiple line functions. These functions are defined with a FUNCTION statement and whose identifier can be any standard identifier name with an optional data type symbol. The general syntax for a function is:
FUNCTION function_name(arguments)
...
function_name = <return value>
...
END FUNCTION
Thursday, December 31, 2009
Wednesday, December 30, 2009
Language Definition – User Defined Functions
The first type of functions being supported are the classic BASIC user defined or single statement functions. These functions are defined with a DEF statement and whose identifier name starts with FN. They may contain multiple arguments or no arguments. The entire function must be defined on one statement (line), though multiple user functions can be defined on the same line separated by colons. Two examples are:
DEF FNHypot(X,Y)=SQR(X*X+Y*Y)
DEF FNLength=LEN(First$+Last$)
Using FN for these functions means that there can be no other variables, arrays, functions or subroutines that start with FN. These user functions may also have data type (the default type is double precision). The user function arguments are considered local variables to the function and are not related to variables of the same name outside the function. For the example above, X and Y are local to FNHypot(), but X and Y outside of the function will not be affected when FNHypot() is called. The variables in FNLength are not arguments and are therefore not local. Any variables used in a function not listed as arguments are regular variables. These functions may also call other functions, but there needs to be check to make sure the function does not call itself.
DEF FNHypot(X,Y)=SQR(X*X+Y*Y)
DEF FNLength=LEN(First$+Last$)
Using FN for these functions means that there can be no other variables, arrays, functions or subroutines that start with FN. These user functions may also have data type (the default type is double precision). The user function arguments are considered local variables to the function and are not related to variables of the same name outside the function. For the example above, X and Y are local to FNHypot(), but X and Y outside of the function will not be affected when FNHypot() is called. The variables in FNLength are not arguments and are therefore not local. Any variables used in a function not listed as arguments are regular variables. These functions may also call other functions, but there needs to be check to make sure the function does not call itself.
Language Definition – Identifiers
Before getting into how the Parser is going to identify token types, there needs to some definition of what the identifiers will look like. There will be no limit placed on the size of identifiers, however, there is a practical upper limit because of the program line length. While I would like to limit line lengths to say 80 characters, this is probably not realistic. Some type of line wrapping will be necessary, but that's a problem for another day.
Identifiers must start with a letter, but may contain any number of letters and numbers plus the under-bar character. Identifiers will be case insensitive, however, identifiers will be saved as first entered. In other words, if a variable name like SomeVariableName is entered, that's how it will be saved, but any form like somevariablename, SOMEVARIABLENAME, SOMEvariableName, etc. will refer to the same variable, but the name will be displayed as it was first entered. (There will be allowance to rename variables later.)
Identifiers must be unique between variables, arrays, functions, subroutines and must not be any of the reserved BASIC commands and operators (e.g. PRINT, IF, etc. or even say Print, however the reserved BASIC command can be used within an identifier, for example, Print5 is acceptable). At the end of the identifier can be an optional symbol for the data type: “%” for integer, “$” for string, and “#” for double precision (the default). Later perhaps single precision can be supported with a “!” character. The data type symbol is considered part of the name, therefore the variable names Variable, Variable% and Variable$ all refer to different variables and may all be contained in a program.
Arrays, functions and subroutine identifier names contain an opening parenthesis at the end with no intervening white space. Note that while the opening parenthesis is considered part of the identifier, it is not stored. Therefore, having both Variable and Variable() in the same program is not allowed. This will allow array names to be used without the parentheses like in passing an entire array to a function or subroutine and MAT statements if implemented. Subroutine identifier names do not have a data type symbol as they don't have a return value.
Identifiers must start with a letter, but may contain any number of letters and numbers plus the under-bar character. Identifiers will be case insensitive, however, identifiers will be saved as first entered. In other words, if a variable name like SomeVariableName is entered, that's how it will be saved, but any form like somevariablename, SOMEVARIABLENAME, SOMEvariableName, etc. will refer to the same variable, but the name will be displayed as it was first entered. (There will be allowance to rename variables later.)
Identifiers must be unique between variables, arrays, functions, subroutines and must not be any of the reserved BASIC commands and operators (e.g. PRINT, IF, etc. or even say Print, however the reserved BASIC command can be used within an identifier, for example, Print5 is acceptable). At the end of the identifier can be an optional symbol for the data type: “%” for integer, “$” for string, and “#” for double precision (the default). Later perhaps single precision can be supported with a “!” character. The data type symbol is considered part of the name, therefore the variable names Variable, Variable% and Variable$ all refer to different variables and may all be contained in a program.
Arrays, functions and subroutine identifier names contain an opening parenthesis at the end with no intervening white space. Note that while the opening parenthesis is considered part of the identifier, it is not stored. Therefore, having both Variable and Variable() in the same program is not allowed. This will allow array names to be used without the parentheses like in passing an entire array to a function or subroutine and MAT statements if implemented. Subroutine identifier names do not have a data type symbol as they don't have a return value.
Parser
The parser needs to take an input line and separate out tokens so that the translator can begin the process of converting the line into the internal RPN format. The tokens will be one of several different types:
- Command Name
- Internal Function Name
- Remark (Comment) (New)
- Operator
- Variable Name
- Array Name
- User Function Name
- Constant
The first four on this list are part of the BASIC language and will be in listed in the Operator Table. The Operator Table will contain several pieces of information like the priority of the operator, the internal code of the operator, the string representation of the operator (used by the Parser and the Recreator), the function to call when running the program, etc. The items in the Operator Table will be expanded as the Interactive BASIC Compiler is developed. For now the Operator Table will contain the strings and the type (command, internal function or operator). For functions, there will also be a data type (integer, double, string or print). (New) Comments will require special handling by the parser.
The last four on this list each will also have a data type associated with them. If a token is not found in the Operator Table, then it is a member of one of the last four, or it is an invalid token (for example, if a symbol is found that is not an operator).
The Parser will return one token with a type and data type at a time. Along with each token will be the column that the token starts. This column will be used for error reporting. There's no point in converting the entire line into tokens before the tokens are processed.
Updated Wednesday, January 6, 2010; 10.55 pm: Added information about remarks.
The last four on this list each will also have a data type associated with them. If a token is not found in the Operator Table, then it is a member of one of the last four, or it is an invalid token (for example, if a symbol is found that is not an operator).
The Parser will return one token with a type and data type at a time. Along with each token will be the column that the token starts. This column will be used for error reporting. There's no point in converting the entire line into tokens before the tokens are processed.
Updated Wednesday, January 6, 2010; 10.55 pm: Added information about remarks.
Monday, December 28, 2009
List Class – Testing and Release
I wrote three simple functions to test the list class template (plus three functions to print the contents of the stack). The first first test function one tests an integer stack. The second test function defines a simple structure consisting of an integer and a character pointer (string). I think the parser (the next item to implement) will need a similar structure to hold pointer to the token string from the input line along with the column the token begins (needed to report the location of syntax errors). The third test function defines a simple enumeration to tests an enumeration stack.
It was during this testing I realized two things. First that I wanted to pass structure values to push onto the stack or pop from the stack by pointers and not by values (passing the entire structure is not efficient). A second was that I still wanted to be able to push constants for lists of simple types. I know this functionality will be needed for the operator stack in the translator. Therefore, this required two sets of push/pop functions. The same function names could be used because of C++'s overloading function name feature. I made the pop by pointer function return the status of whether the stack is empty before a value is popped, in other words, if false is returned, the stack is empty and no value is returned.
I have released this code so far. The ibcp_0.0.1-src.zip file has been uploaded at Sourceforge IBCP Project. The .zip file contains the list.h file with the list template class, the test_stack.cpp source file for testing lists (stacks), the VIDE2 project file test_stack.vpj and Makefile.v, both generated by VIDE2. The project was compiled using Borland C++ 5.5. (I have yet to get GCC working under MinGW for use with VIDE2 – something I will keep trying as I want to eventually compare the two compilers.)
It was during this testing I realized two things. First that I wanted to pass structure values to push onto the stack or pop from the stack by pointers and not by values (passing the entire structure is not efficient). A second was that I still wanted to be able to push constants for lists of simple types. I know this functionality will be needed for the operator stack in the translator. Therefore, this required two sets of push/pop functions. The same function names could be used because of C++'s overloading function name feature. I made the pop by pointer function return the status of whether the stack is empty before a value is popped, in other words, if false is returned, the stack is empty and no value is returned.
I have released this code so far. The ibcp_0.0.1-src.zip file has been uploaded at Sourceforge IBCP Project. The .zip file contains the list.h file with the list template class, the test_stack.cpp source file for testing lists (stacks), the VIDE2 project file test_stack.vpj and Makefile.v, both generated by VIDE2. The project was compiled using Borland C++ 5.5. (I have yet to get GCC working under MinGW for use with VIDE2 – something I will keep trying as I want to eventually compare the two compilers.)
List Class – Implementation
A class template does not actually generate any code, so all the functions will be written directly in the header (.h) file (which turns out to be necessary to properly instantiate an actual list in the code). To instantiate a list, the following is used:
List<int> int_stack;
List<some_struct> some_struct_stack;
To create a pointer to element within the list, the following is used:
List<int>::Element *int_element;
List<some_struct>::Element *some_element;
Note that the struct Element is defined as public to allow this (though the master element pointer is private). The functions implemented initially are:
List() – constructor; allocates master element and initializes it
~List() – destructor; deallocates any elements in list and master element
bool empty() – checks if list is currently empty
Element *top() – get a pointer to element on top of stack.
push(T value) – pushes value (by value) on stack
T pop() - pops value from top of stack (must not be empty)
push(T *value) – pushes value (by pointer) on stack
bool pop(T *value) – pops value (by pointer) from top of stack (if not empty)
first(Element *&element) – sets pointer to first element
next(Element *&element) – sets pointer to next element
bool not_end(Element *&element) – check if not end
There are two sets of push and pop functions. One set works with actual values and is more appropriate for lists of simple types (int, doubles, enums, etc.). The other set works with pointers to the actual values, which is better for lists of structures, but can be used for simple types also. I implemented the by value push and pop so that constants could be pushed directly without having to define a variable and set it to the value to push. I foresee this may be necessary for some of the stacks. Note that the pop value by value function does not return any status of the list being empty, and therefore the list must not be empty before using this function. I also added the three functions for use with for statements. Any additional functions will be added to the list class template as needed.
List<int> int_stack;
List<some_struct> some_struct_stack;
To create a pointer to element within the list, the following is used:
List<int>::Element *int_element;
List<some_struct>::Element *some_element;
Note that the struct Element is defined as public to allow this (though the master element pointer is private). The functions implemented initially are:
List() – constructor; allocates master element and initializes it
~List() – destructor; deallocates any elements in list and master element
bool empty() – checks if list is currently empty
Element *top() – get a pointer to element on top of stack.
push(T value) – pushes value (by value) on stack
T pop() - pops value from top of stack (must not be empty)
push(T *value) – pushes value (by pointer) on stack
bool pop(T *value) – pops value (by pointer) from top of stack (if not empty)
first(Element *&element) – sets pointer to first element
next(Element *&element) – sets pointer to next element
bool not_end(Element *&element) – check if not end
There are two sets of push and pop functions. One set works with actual values and is more appropriate for lists of simple types (int, doubles, enums, etc.). The other set works with pointers to the actual values, which is better for lists of structures, but can be used for simple types also. I implemented the by value push and pop so that constants could be pushed directly without having to define a variable and set it to the value to push. I foresee this may be necessary for some of the stacks. Note that the pop value by value function does not return any status of the list being empty, and therefore the list must not be empty before using this function. I also added the three functions for use with for statements. Any additional functions will be added to the list class template as needed.
Sunday, December 27, 2009
List Class – Design
There will be several different stacks needed. Each type of stack would have the same functionality but would hold a different type of items. The best way to implement this is by using a class template. The list class template will have and element contained the next and previous link pointers plus the generic value. The beginning of the list class will be defined as:
template <class T> List {...}
Where T is the generic type that the list will hold. This generic type could be a simple like an int or double, or could be a structure. The internal element structure containing the links and value will be defined as:
struct Element {
Element *prev;
Element *next;
T value;
};
The list will have a master element pointer that will be a pointer to Element but that actually element allocated will be without the T value; in other words, allocated for sizeof(Element) – sizeof(T). The master element pointer will be allocated and initialized in the constructor function.
The master->next variable will point to the first element in the list (i.e. bottom of a stack) and the master->prev variable will pointer to the last element in the list (i.e. the top of a stack). Implementation of the various functions is simplified if the list is circular, in other words, the last element's next pointer points back to the master element (and the first element's previous pointer also points back to the master element). In an empty list, both the master element's next and previous pointers would point to the master element. Therefore, the master element's pointers will be initialized this way after it is allocated.
template <class T> List {...}
Where T is the generic type that the list will hold. This generic type could be a simple like an int or double, or could be a structure. The internal element structure containing the links and value will be defined as:
struct Element {
Element *prev;
Element *next;
T value;
};
The list will have a master element pointer that will be a pointer to Element but that actually element allocated will be without the T value; in other words, allocated for sizeof(Element) – sizeof(T). The master element pointer will be allocated and initialized in the constructor function.
The master->next variable will point to the first element in the list (i.e. bottom of a stack) and the master->prev variable will pointer to the last element in the list (i.e. the top of a stack). Implementation of the various functions is simplified if the list is circular, in other words, the last element's next pointer points back to the master element (and the first element's previous pointer also points back to the master element). In an empty list, both the master element's next and previous pointers would point to the master element. Therefore, the master element's pointers will be initialized this way after it is allocated.
Saturday, December 26, 2009
Popping Items From Middle of Stack
I initially thought there was a need to pop items from the middle of stack, specifically for functions and subroutine calls. Essentially the thinking was that the function or subroutine call would push the arguments on the stack, then upon executing the function/subroutine, it would save the current top of the stack, push it's argument references on the stack (because that is what would be written in the internal language), and then it would pop from the two places as it filled the local variables that represented the arguments. I won't go into any more details because this is probably not the best way to implement function/subroutine calls.
The way C/C++ works internally is that the arguments are pushed onto the stack, and the function (there are technically no subroutines in C/C++, all routines are functions, though some functions can be defined with void return value, i.e. no return value), uses the values on the stacks as the local variables for the arguments within the function (other local variable are also on the stack below the arguments). C/C++ only passes by value, not by reference (though by reference is emulated using pointers and C++ actually does has references, but it is still done by pointers internally).
Now I think emulating this function/subroutine call mechanism where the local variables representing the arguments are the values on the stack is a better way. This mechanism does not need to pop items from the middle of the stack, so this functionality is not necessary. Though technically, this is the same as removing items from the middle of a list, and I believe will be needed, though not for stacks, but for more generic lists.
The way C/C++ works internally is that the arguments are pushed onto the stack, and the function (there are technically no subroutines in C/C++, all routines are functions, though some functions can be defined with void return value, i.e. no return value), uses the values on the stacks as the local variables for the arguments within the function (other local variable are also on the stack below the arguments). C/C++ only passes by value, not by reference (though by reference is emulated using pointers and C++ actually does has references, but it is still done by pointers internally).
Now I think emulating this function/subroutine call mechanism where the local variables representing the arguments are the values on the stack is a better way. This mechanism does not need to pop items from the middle of the stack, so this functionality is not necessary. Though technically, this is the same as removing items from the middle of a list, and I believe will be needed, though not for stacks, but for more generic lists.
Friday, December 25, 2009
Reading Stacks From Bottom to Top
Take the example of a PRINT statement as follows:
PRINT A;TAB(20);B,C
I'm anticipating that this will be encoded into the internal language symbolically as follows:
A 20 TAB B COMMA C PRINTNL
Note that semicolons are not encoded, but assumed between expressions. There will also be two types of PRINT statements, one with an implied new-line (PRINTNL) and one that ends with a semicolons or commas to stay on the current line (PRINT).
When this command is executed, the expression stack will look as follows after all the parts are executed up to the PRINTNL command:
A <-- bottom
TAB(20)
B
COMMA
C <-- top
When the PRINTNL is read, the PRINTNL command code will be called. To print the items in order (i.e. A first), the stack will need to be read from from the bottom to the top. The expressions and functions could be written into the internal language backwards so they will be on the stack in the correct order during run time (to pull each from the top), but this will complicate the translator and the recreator, plus the expressions would be evaluated backwards, an idea I'm not comfortable with. Therefore, it's easiest to keep them in the same order as entered, hence the need to read the stack from the bottom to the top.
PRINT A;TAB(20);B,C
I'm anticipating that this will be encoded into the internal language symbolically as follows:
A 20 TAB B COMMA C PRINTNL
Note that semicolons are not encoded, but assumed between expressions. There will also be two types of PRINT statements, one with an implied new-line (PRINTNL) and one that ends with a semicolons or commas to stay on the current line (PRINT).
When this command is executed, the expression stack will look as follows after all the parts are executed up to the PRINTNL command:
A <-- bottom
TAB(20)
B
COMMA
C <-- top
When the PRINTNL is read, the PRINTNL command code will be called. To print the items in order (i.e. A first), the stack will need to be read from from the bottom to the top. The expressions and functions could be written into the internal language backwards so they will be on the stack in the correct order during run time (to pull each from the top), but this will complicate the translator and the recreator, plus the expressions would be evaluated backwards, an idea I'm not comfortable with. Therefore, it's easiest to keep them in the same order as entered, hence the need to read the stack from the bottom to the top.
Wednesday, December 23, 2009
Stacks, Queue and Lists
A stack is a Last-In-First-Out (LIFO) list, a queue is a First-In-First-Out (FIFO) list, and a list is an Any-In-Any-Out list meaning elements can added to the beginning, middle or end of a list and elements can be removed from the beginning, middle or end of the list. Stacks inside the computer are essentially implemented as an array with an index pointing to the top of the stack, growing as large as needed given available memory. To implement this in a higher level language like C++, this type of implementation would need a fixed size of memory allocated. Changing the size during run time, while not impossible, is not practical or efficiently.
A better way to implement a stack in C++ is as a linked list, because memory does not need to be allocated ahead of time for the stack whose size is not yet known. To add a new element to the top of the stack, a single element is allocated from the free memory heap, then linked into the current list. To remove the element from the top of the stack, it is simply de-linked from the stack and the item is deallocated back to the free memory heap. This method requires an overhead of a link with each item (that points to the next item on the stack) and the stack would contain a pointer to the top element (or a NULL pointer if the stack is empty).
However, as will be explained, there will be occasions where it will be necessary to read a stack from the bottom to the top. And there will be occasions where it will be necessary to pop an item from the middle and remove it.
Using the single link method, a stack cannot be read from the bottom to the top or read and removed items from the middle. In order to support this, a double link method will be needed, which is required for an AIAO list. One link points to the next item in the list, the other link points to the previous item in the list. A list class will be implemented to support lists, stacks and queues.
A better way to implement a stack in C++ is as a linked list, because memory does not need to be allocated ahead of time for the stack whose size is not yet known. To add a new element to the top of the stack, a single element is allocated from the free memory heap, then linked into the current list. To remove the element from the top of the stack, it is simply de-linked from the stack and the item is deallocated back to the free memory heap. This method requires an overhead of a link with each item (that points to the next item on the stack) and the stack would contain a pointer to the top element (or a NULL pointer if the stack is empty).
However, as will be explained, there will be occasions where it will be necessary to read a stack from the bottom to the top. And there will be occasions where it will be necessary to pop an item from the middle and remove it.
Using the single link method, a stack cannot be read from the bottom to the top or read and removed items from the middle. In order to support this, a double link method will be needed, which is required for an AIAO list. One link points to the next item in the list, the other link points to the previous item in the list. A list class will be implemented to support lists, stacks and queues.
Saturday, December 19, 2009
Time To Get Started
I'm anxious to get started coding. I have a bunch of notes scratched down on the various parts of the interactive compiler. I wanted to have a long discussion on GOTOs (essentially what they are used for and how structured elements of a programming language can eliminate their need). But I'll go into this as I discuss the specific implementation of the BASIC commands as needed. Before coding begins, some organization and planning are needed for the basic components. Right now I see the basic components as follows:
The translator, expression, encoder, re-creator and run time modules are require stacks, lists and/or queues in some form as will been seen as the design of each is developed. Stacks, queues and lists are very similar except how elements are added and removed. This is where I will start – developing code to implement stacks and lists.
- LANGUAGE – definition of the language elements.
- INTERNAL LANGUAGE – the way the program is stored in memory.
- PARSER – parses the lines into tokens.
- TRANSLATOR – translates the tokens into RPN elements.
- EXPRESSIONS – part of the translator that processes expressions.
- ENCODER – encodes the RPN elements into the internal language.
- RECREATOR – recreates the origin source from the internal language.
- RUNTIME – runs the program.
The translator, expression, encoder, re-creator and run time modules are require stacks, lists and/or queues in some form as will been seen as the design of each is developed. Stacks, queues and lists are very similar except how elements are added and removed. This is where I will start – developing code to implement stacks and lists.
Sunday, December 13, 2009
More About Blocking
An example of the blocking problem, say with an IF/ENDIF block, if the IF/THEN line is entered first, this constitutes a context error until its corresponding ENDIF line is entered and once entered the block is complete and the context error count is reset allowing the program to run. Now say another IF/THEN line is entered between the first IF/THEN/ENDIF block. The new IF/THEN would be connected to the ENDIF line and the first IF/THEN is now in error. This block connection and disconnection can get rather involved and will require careful programming.
Once the screen editor component is implemented, it would be nice to highlight these context errors in the editor. For example if an IF/THEN or ENDIF line does not have a corresponding ENDIF or IF/THEN line, then the “IF” would display in reverse video or with a red background to let the user know there is a problem. The editor will also be able to automatically indent the contents within blocks, another visual cue to assist the programmer.
This error highlighting would apply to other types of context errors, for instances with arrays. Lines may be entered using an array before the DIM line for the array is entered. The number of subscripts that an array has can be inferred by the first time it is entered and if any subsequent references of the array have a different number of subscripts, a context error is flagged and all instances of the array are highlighted. The size of the array and the actual number of subscripts is not known until the actual DIM line is entered. There would be different highlighting for mismatch array subscripts and arrays that don't yet have a DIM (or perhaps have more than one DIM or a DIM after the array is referenced). When a DIM line is entered (before the first reference of the array), the highlighting is removed from the instances of the array (assuming of course that there are no further context errors, like instanced having the wrong number of subscripts).
Once the screen editor component is implemented, it would be nice to highlight these context errors in the editor. For example if an IF/THEN or ENDIF line does not have a corresponding ENDIF or IF/THEN line, then the “IF” would display in reverse video or with a red background to let the user know there is a problem. The editor will also be able to automatically indent the contents within blocks, another visual cue to assist the programmer.
This error highlighting would apply to other types of context errors, for instances with arrays. Lines may be entered using an array before the DIM line for the array is entered. The number of subscripts that an array has can be inferred by the first time it is entered and if any subsequent references of the array have a different number of subscripts, a context error is flagged and all instances of the array are highlighted. The size of the array and the actual number of subscripts is not known until the actual DIM line is entered. There would be different highlighting for mismatch array subscripts and arrays that don't yet have a DIM (or perhaps have more than one DIM or a DIM after the array is referenced). When a DIM line is entered (before the first reference of the array), the highlighting is removed from the instances of the array (assuming of course that there are no further context errors, like instanced having the wrong number of subscripts).
Blocking Commands
Where it's going to get complicated for my interactive compiler is with the blocking commands. Examples of blocking commands include IF/THEN/ELSE/ENDIF, FOR/NEXT, DO/LOOP and SELECT/CASE. These commands are entered on multiple lines and can be entered in any order. A regular compiler matches the blocks during compilation and any unmatched blocks are reported as “context” errors. Generally an interpreter does not catch these context errors until run time since the program lines can be entered in any order. Ideally, an interactive compiler should catch as many errors as possible as soon as possible and before run time, including these context errors. For run time speed, the less checking done at run time, the faster it will run the program.
Brown suggests ignoring these context errors when the lines are entered and having a “pre-run” module that goes through the program checking for context errors before the program is run. Brown briefly discusses an alternative method where tables are kept as these context errors are created in a table and removing entries as the context errors are resolved, but decides this method is too complicated and opts for the simpler approach of having a pre-run module.
I disagree with the need to have a pre-run module, which would delay the starting of the program when the RUN command is entered. This takes away one of the advantages of an interactive compiler and would make it appear to be simply compiling the program before running. I plan on taking Brown's table idea further. It's difficult to explain, but essentially as lines are entered, a table with information on the blocks in the code is built as the lines are entered and the number of current context errors is kept track of. These context errors are different from syntax errors that can be caught immediately when a line is entered. With a block table and context error outstanding count, when a RUN command is issued, this count simply needs to be checked and if not zero, the errors are reported and the program is not run.
Brown suggests ignoring these context errors when the lines are entered and having a “pre-run” module that goes through the program checking for context errors before the program is run. Brown briefly discusses an alternative method where tables are kept as these context errors are created in a table and removing entries as the context errors are resolved, but decides this method is too complicated and opts for the simpler approach of having a pre-run module.
I disagree with the need to have a pre-run module, which would delay the starting of the program when the RUN command is entered. This takes away one of the advantages of an interactive compiler and would make it appear to be simply compiling the program before running. I plan on taking Brown's table idea further. It's difficult to explain, but essentially as lines are entered, a table with information on the blocks in the code is built as the lines are entered and the number of current context errors is kept track of. These context errors are different from syntax errors that can be caught immediately when a line is entered. With a block table and context error outstanding count, when a RUN command is issued, this count simply needs to be checked and if not zero, the errors are reported and the program is not run.
Saturday, December 12, 2009
Immediate Commands
With the goal of keeping the first stage simple, the program will use console mode with classic BASIC line numbers. This will require immediate commands – commands executed immediately when entered (but not allowed in a program). Initially the interactive compiler will have a prompt and command format like BASIC interpreters have. If the command entered starts with a line number, then it will be compiled and inserted into the program memory, otherwise the command will be executed. The regular BASIC commands will be allowed on the command line, but I'm not sure this moment about the looping commands. The immediate commands needed for now include the following:
- SAVE
- LOAD
- LIST
- EDIT
- RENUM (to renumber the program lines)
- DELETE (lines can also be deleted by entering a line number be itself)
- RUN
- NEW
Friday, December 11, 2009
Initial BASIC Language Subset
Below is a list of BASIC commands that will be implemented first. These commands will allow simple programs to be entered and run. It will allow the expression parser and the complicated blocking code to be implemented and tested. These are listed in the approximate order to be implemented.
- LET (assignment)
- INPUT
- DIM (for arrays)
- REM
- IF/THEN/ELSE/ENDIF
- FOR/NEXT
- DO/LOOP (with WHILE/UNTIL)
- Double precision math operators (+,-,*,/, \, MOD, and ^)
- Variables (with any size name)
- Constants
- Numerical functions (ABS, FIX, INT, RND, SGN, and SQR)
- Scientific functions (ATN, COS, EXP, LOG, SIN, and TAN)
- Relational operators (=,<>,<,<=,>, >=, AND, EQV, IMP, NOT, OR, and XOR)
- Print functions (comma, TAB, and SPC)
- Multiple dimension arrays
- Multiple statements per line
The important thing is to get something working and not to over complicate the initial stage. Important and necessary features like strings and integers with associated operators and variables, SELECT CASE, subroutines and user functions etc. will come in later stages. Notice the lack of a GOTO command - this was intentional. Initially classic BASIC line numbers will be used (despite the lack of a GOTO) as a way of entering the lines into memory via console mode. Eventually this will be replaced with a screen editor, but a screen editor is a major undertaking in itself, so this will be delayed for a later stage.
Wednesday, December 9, 2009
Internal Language
Brown suggests that the internal language be in the form of RPN (Reverse Polish Notation) and I agree. As it will allow the program to be easily and quickly run. Basically the following commands:
A = B + C * 4
PRINT A+B;TAB(20);C+5
Would be translated to RPN, which would symbolically be represented as something like:
A B C 4 * + =(assignment)
A B + 20 TAB C 5 + PRINT
When the program is run, each operand is pushed onto a stack. When an operator is reached, it's operands (e.g. two for *, one for TAB, etc.) are pulled from the stack, the operation is performed and the result is pushed back onto the stack. When a command is reached (e.g. PRINT), it preforms it's procedure on the values that are currently on the stack. It's more involved than this, but this is the general idea. The actual internal language is encoded from the symbolic RPN representation.
A = B + C * 4
PRINT A+B;TAB(20);C+5
Would be translated to RPN, which would symbolically be represented as something like:
A B C 4 * + =(assignment)
A B + 20 TAB C 5 + PRINT
When the program is run, each operand is pushed onto a stack. When an operator is reached, it's operands (e.g. two for *, one for TAB, etc.) are pulled from the stack, the operation is performed and the result is pushed back onto the stack. When a command is reached (e.g. PRINT), it preforms it's procedure on the values that are currently on the stack. It's more involved than this, but this is the general idea. The actual internal language is encoded from the symbolic RPN representation.
Sunday, December 6, 2009
Development Platform
Development will be done Windows XP Home as that is what my computer runs (for the moment I do not intend to upgrade, to Vista or 7, as XP Home is working just fine). For now I'm not planning to develop for cross platforms (like for example the way FreeBasic runs on several platforms). Besides, whatever criticisms there are about Windows, it does have the largest market share.
I thought about trying out some of the algorithms in Tcl (using the free ActiveTcl package) and even sketched some of out in Tcl, but Tcl is a relatively unknown language, is basically in interpreter itself (though it may somewhat compile code internally – I not familiar with the details) and is not really fast. I also considered using something like FreeBasic, but that would be something new to learn (and most of my BASIC experience is with GW-Basic). I have the most experience with C and C++, though I'm currently rusty on C++ as I haven't been using the last 10 years. I decided on C++ because I think it is a good language and this is a chance to get back into it (and it may be come is use in the near future in my regular job).
There are many C++ compilers I could use, but I decided on the free Borland C++ 5.5 command line tools using the free Vide2. The second choice would be the GNU GCC package with G++ if the Borland C++ proves unsatisfactory, but so far I was not able to get it work on Windows using MinGW and MSYS (I do use GCC for my regular job on Linux). I learned C++ on Borland's C++ package for DOS with DPMI upgrading through 4.5 during the 1990s. And was able to successfully get Borland C++ 5.5 to work with Vide2. I'm not buying a commercial product for this effort since this is right now just a hobby.
I thought about trying out some of the algorithms in Tcl (using the free ActiveTcl package) and even sketched some of out in Tcl, but Tcl is a relatively unknown language, is basically in interpreter itself (though it may somewhat compile code internally – I not familiar with the details) and is not really fast. I also considered using something like FreeBasic, but that would be something new to learn (and most of my BASIC experience is with GW-Basic). I have the most experience with C and C++, though I'm currently rusty on C++ as I haven't been using the last 10 years. I decided on C++ because I think it is a good language and this is a chance to get back into it (and it may be come is use in the near future in my regular job).
There are many C++ compilers I could use, but I decided on the free Borland C++ 5.5 command line tools using the free Vide2. The second choice would be the GNU GCC package with G++ if the Borland C++ proves unsatisfactory, but so far I was not able to get it work on Windows using MinGW and MSYS (I do use GCC for my regular job on Linux). I learned C++ on Borland's C++ package for DOS with DPMI upgrading through 4.5 during the 1990s. And was able to successfully get Borland C++ 5.5 to work with Vide2. I'm not buying a commercial product for this effort since this is right now just a hobby.
Background Part III – Current Plans
Flash forward to a couple of months ago during a house remodeling project, I came across the old notes of my BASIC and a book that I was taking inspiration from at the time called Writing Interactive Compilers and Interpreters, 1980 by P. J. Brown. I started reading the book and was fascinated. I remembered my BASIC, the promise of the incremental compiler QuickBasic 4.5, and now had a renewed interest in trying to create an interactive incremental BASIC compiler as I had some new ideas – things that I personally have not seen in a development environment during all my years of software development.
My intention this time around is to develop an Interactive BASIC Compiler with an end goal of what I thought QuickBasic should have been. This is going to be a complicated project, so starting small and doing it in stages with several steps each will be required. The initial goal is to get something working and then building upon it. I will not be using anything previously developed from my BASIC interpreter. That was in 6809 assembler and while I do have source listing printouts (any e-copies are long lost), it's not really applicable to the current project. I will be looking at some of the notes I made about the BASIC language itself that I was thinking about at the time as it may be applicable. I will also be looking at existing BASIC compilers for inspiration (FreeBasic, xBasic, RealBasic, TrueBasic, PowerBasic, thinBasic, Smallbasic and of course GW-Basic). This project will not be written in assembly language - that just more effort that I want to take on. The ground rules for now are as follows:
Why even bother as there are so many good software development tools around? Because the whole idea fascinates me and programming is not only my career, but also a hobby. And maybe I can demonstrate some unique ideas along the way. And I'm really attempting to compete with those other BASICs – that is just not the current goal.
My intention this time around is to develop an Interactive BASIC Compiler with an end goal of what I thought QuickBasic should have been. This is going to be a complicated project, so starting small and doing it in stages with several steps each will be required. The initial goal is to get something working and then building upon it. I will not be using anything previously developed from my BASIC interpreter. That was in 6809 assembler and while I do have source listing printouts (any e-copies are long lost), it's not really applicable to the current project. I will be looking at some of the notes I made about the BASIC language itself that I was thinking about at the time as it may be applicable. I will also be looking at existing BASIC compilers for inspiration (FreeBasic, xBasic, RealBasic, TrueBasic, PowerBasic, thinBasic, Smallbasic and of course GW-Basic). This project will not be written in assembly language - that just more effort that I want to take on. The ground rules for now are as follows:
- Implement a small subset of BASIC commands to start.
- Implement a selection of immediate commands.
- Develop on Windows using C++.
- Use console mode for now (like a BASIC interpreter).
- Release the code at each stage/step under GPL on SourceForge.
Why even bother as there are so many good software development tools around? Because the whole idea fascinates me and programming is not only my career, but also a hobby. And maybe I can demonstrate some unique ideas along the way. And I'm really attempting to compete with those other BASICs – that is just not the current goal.
Background Part II – Plans Not Realized
While I was working on my BASIC Interpreter, a local computer store that dealt in SS-50 bus computers along with a handful of investors (not me though) designed their own SS-50 bus – a computer that was similar to other small computers of the time (had a built-in keyboard, just plug it in an attached a monitor or TV). These other computers (e.g. TRS-80, Apple, Commodore PET, OSI, etc.) came with BASIC in ROM (read-only memory) so BASIC was there when the computer was turned on. Most SS-50 bus type computers required a terminal to communicate with through a serial interface. One of the member designed a complete terminal on a board that when put into the computer appeared as a serial device. This board was actually a computer itself with a 6809 CPU, CRTC (CRT controller), EPROM and keyboard interface. I was the one the actually wrote the control program for this board. They wanted a BASIC in ROM for the computer. So I converted my 8K BASIC to run in 8K of EPROM and it was simply called ROM-BASIC.
The next issue that caused my 8K BASIC to be slower than Microsoft's was the math. The 8K BASIC used a packed BCD (binary coded decimal) 6-byte format for floating point numbers. The Microsoft BASIC used a 4-byte binary floating point (probably IEEE format or something very similar). Binary floating point with only 4 bytes is much fasted than 6 byte packed BCD. I had a book that explained binary floating point and contained code for a math package (only the standard 4 math functions plus the necessary support functions), but was written in 6800. I converted this to 6809 with all the optimizations possible. I still had to deal with all the scientific functions. I never did get this plugged into my 8K BASIC. I was going to college at the time and that took my time. I was keeping a notebook at the time and the dates end around 1981. There were few more notes from the end of 1984/ begin of 1985 (a break at college) and a few more dated October 1986. The time for a 6809 BASIC had passed and I never picked it back up.
During the 1980's I worked part/full time for a local company developing their business software. Initially this was developed on a TRS-80 Model II in BASIC (another Microsoft variety). Around 1984, the software was converted to BasicA/GW-Basic to run on PCs, using BASCOM3 to compile the code for production. Sometime in the mid-1980's, Microsoft released their QuickBasic 4.5 an incremental compiler. I though it was such a great idea and had good promise, but upon using it, I decided that they could have taken it much further. Also, it ran programs much more slowly than BASCOM3, so development continued with GW-Basic using BASCOM3 for compiling.
The next issue that caused my 8K BASIC to be slower than Microsoft's was the math. The 8K BASIC used a packed BCD (binary coded decimal) 6-byte format for floating point numbers. The Microsoft BASIC used a 4-byte binary floating point (probably IEEE format or something very similar). Binary floating point with only 4 bytes is much fasted than 6 byte packed BCD. I had a book that explained binary floating point and contained code for a math package (only the standard 4 math functions plus the necessary support functions), but was written in 6800. I converted this to 6809 with all the optimizations possible. I still had to deal with all the scientific functions. I never did get this plugged into my 8K BASIC. I was going to college at the time and that took my time. I was keeping a notebook at the time and the dates end around 1981. There were few more notes from the end of 1984/ begin of 1985 (a break at college) and a few more dated October 1986. The time for a 6809 BASIC had passed and I never picked it back up.
During the 1980's I worked part/full time for a local company developing their business software. Initially this was developed on a TRS-80 Model II in BASIC (another Microsoft variety). Around 1984, the software was converted to BasicA/GW-Basic to run on PCs, using BASCOM3 to compile the code for production. Sometime in the mid-1980's, Microsoft released their QuickBasic 4.5 an incremental compiler. I though it was such a great idea and had good promise, but upon using it, I decided that they could have taken it much further. Also, it ran programs much more slowly than BASCOM3, so development continued with GW-Basic using BASCOM3 for compiling.
Saturday, December 5, 2009
Background Part I – BASIC Interpreter
I worked on my own BASIC interpreter back around 1980-1981. The first computer I used was a SWTPC 6800 (SS-50 bus computer) with 4K of memory (1977). It did not have enough memory to run either the 4K or 8K BASIC that was available. Eventually it was upgraded with enough memory to run the 8K BASIC and eventually it upgraded to a 6809 (around 1979). The 6800 machine code could not be run on the 6809 directly. The 6809 was source level compatible with the 6800, which meant the a 6800 assembly source file could be assembled with a 6809 assembler with no changes to produce 6809 machine code.
This BASIC interpreter started out as a simple conversion from 6800 to 6809. I used a 6800 disassembler on the 8K BASIC. With some modification of the output, I was able to get the 8K BASIC running on the 6809. Around the same time we acquired Microsoft BASIC for this computer, I don't remember the exact timing or whether it was for the 6800 or 6809. I do remember that the Microsoft BASIC was much faster. So I dug deeper to find out why.
It turned out that the 8K BASIC interpreter was a pure interpreter. That is it put the lines entered directly into memory as is. I did discover that it did convert the first command on a line (PRINT, IF, etc.) into an address (pointer) to an internal table that contain the text of the command and the location of the subroutine for handling the command. This was probably done for executing direct commands. But it only converted the first command on the line, not any other commands on a line (multiple statements separated by colons). For these it reverted to parsing mode for execution. The Microsoft BASIC on the other hand converted all commands, words (TO, THEN, etc.) and operators (<>, <=, etc.) into one-byte tokens. So when it ran the program, the parsing of the tokens was already performed.
At this point, I did two things to my 8K BASIC, I parsed all tokens into one-byte codes and I rewrote most of the code to take advantage of the 6809. This significantly made it faster, but still not quite the speed of the Microsoft BASIC.
This BASIC interpreter started out as a simple conversion from 6800 to 6809. I used a 6800 disassembler on the 8K BASIC. With some modification of the output, I was able to get the 8K BASIC running on the 6809. Around the same time we acquired Microsoft BASIC for this computer, I don't remember the exact timing or whether it was for the 6800 or 6809. I do remember that the Microsoft BASIC was much faster. So I dug deeper to find out why.
It turned out that the 8K BASIC interpreter was a pure interpreter. That is it put the lines entered directly into memory as is. I did discover that it did convert the first command on a line (PRINT, IF, etc.) into an address (pointer) to an internal table that contain the text of the command and the location of the subroutine for handling the command. This was probably done for executing direct commands. But it only converted the first command on the line, not any other commands on a line (multiple statements separated by colons). For these it reverted to parsing mode for execution. The Microsoft BASIC on the other hand converted all commands, words (TO, THEN, etc.) and operators (<>, <=, etc.) into one-byte tokens. So when it ran the program, the parsing of the tokens was already performed.
At this point, I did two things to my 8K BASIC, I parsed all tokens into one-byte codes and I rewrote most of the code to take advantage of the 6809. This significantly made it faster, but still not quite the speed of the Microsoft BASIC.
What is an Interactive Compiler?
A Compiler is a program that converts a human readable programming language like BASIC into a language understood by a computer. Once converted, the program is no longer readable by a human and is not easy if not impossible to convert back to the original human readable source file. However, the compiled program runs fast.
An Interpreter is a program that directly executes a human readable programming language directly. Interpreters tends to run the program slowly since it is continuously reading and parsing the source file to execute the program. However, interpreters offer interactivity where no separate compile step is necessary and offers the ability to execute commands immediately (like for examining program variables when the program is temporarily stopped like for debugging).
An Incremental Compiler converts or compilers each line of a source file into an internal language understood by the computer as it is entered. It offers the advantages of a compiler (speed) and the advantages of an interpreter (interactivity).
An Interactive Compiler is technically any programming environment that is interactive, anything from a pure interpreter to an incremental compiler, but not a full a compiler. What I am referring to as an Interactive Compiler is one that its internal language is in the form that is easily converted back into the original source lines (though not necessary exactly as typed) and in a form that the computer can run efficiently without doing an time wasting interpretation (which is done during the creation of the internal code and not during run time as in an interpreter).
An Interpreter is a program that directly executes a human readable programming language directly. Interpreters tends to run the program slowly since it is continuously reading and parsing the source file to execute the program. However, interpreters offer interactivity where no separate compile step is necessary and offers the ability to execute commands immediately (like for examining program variables when the program is temporarily stopped like for debugging).
An Incremental Compiler converts or compilers each line of a source file into an internal language understood by the computer as it is entered. It offers the advantages of a compiler (speed) and the advantages of an interpreter (interactivity).
An Interactive Compiler is technically any programming environment that is interactive, anything from a pure interpreter to an incremental compiler, but not a full a compiler. What I am referring to as an Interactive Compiler is one that its internal language is in the form that is easily converted back into the original source lines (though not necessary exactly as typed) and in a form that the computer can run efficiently without doing an time wasting interpretation (which is done during the creation of the internal code and not during run time as in an interpreter).
Subscribe to:
Posts (Atom)