Interactive BASIC Compiler Project: June 2010

Wednesday, June 30, 2010

Translation – Expression Type

The Translator needs to keep track of the expression type as an expression is translated. The expression type is similar to the data type, but not the same. There are two types of expressions, Numeric (the double and integer data types are interchangeable in expressions) and String (including the sub-string data type). The issue was discovered when working on the INPUT PROMPT string expression, but the problem applies to all expressions. Consider these examples that will currently report an error at an inappropriate token:

Z = A$ + B$ + C$
^-- expected double

Z$ = A + B * C
^-- expected string

These errors will be confusing. The errors in both cases should be pointing to the A$ or A variables and should report “expected numeric expression” and “expected string expression” errors. Using just the word double in the message might imply that an integer expression is unacceptable, which is not the case.

The expression type may need to be set to a generic Any expression type, like would be needed for the PRINT command's expressions or the arguments of non-internal functions. For internal functions, the expression type can be set to the correct value for each argument (using the Numeric expression type if the argument is a double or an integer).

For arrays, the subscripts need to be integers (Numeric), but remember that the Translator does not know if a token with a parentheses is an array or a user function. So the expression type will have to be set to the Any expression type and the Encoder will have to do the checking once it is known whether the token is an array or user function.

Tuesday, June 29, 2010

Translation – INPUT and Error Reporting

Upon designing the actions required by the add input codes routine, I realized that it may need to point to a different token when an error is detected. Consider this example:

INPUT PROMPT A;B

The error will be “expected string expression” and should be pointing to the A, but the current token being processed is the semicolon token. Therefore, instead of passing the token code of the token being processed, a reference to the token pointer will be passed so that it can be changed to point to the actual token with the error.

For The INPUT command handler (called at end of statement tokens), a Colon code was going to be passed regardless of the actual end of statement token, but now a reference to a token pointer will be passed. This will be whatever end of statement token, which was passed to the INPUT command handler and will be passed on to the add input codes routine.

This led to another problem, consider this example:

INPUT PROMPT A*B+C;D

The + operator will be on top of the done stack after the expression is processed, so the error will be pointing to + since it will have the double data type. The error should be pointing to the A indicating that a string expression was expected. This problem applies to other statements as well, so some sort of expression type detection is needed...

Monday, June 28, 2010

Translator – INPUT Command (Design)

As entries were being written for INPUT with Semicolons, INPUT with Commas, and INPUT Command Handler, there was a lot of copying and pasting of the same text for what the will need to do for translation. In considering how to reduce the amount of words, a realization was made that this is an indication that since the code is the same or very similar that perhaps the common code should be put into a single routine.

The plan was already to have an add input code routine like the add print code routine. But since there is more common code for the INPUT and INPUT PROMPT commands, the functionality of this add input codes (plural since multiple codes are involved) routine will be expanded. This routine will need to know the command being translated (INPUT or INPUT PROMPT) along with the current command flags and will need to know which token is being processed (Semicolon, Comma, or an end of statement token).

The first argument then will be the command item from the command stack, which contains the current command code, command flags and the command's token. From the Semicolon and Comma token handlers, this will be the top command item on the command stack. From the INPUT command handler, the command item has already been popped from the command stack and is passed as an argument. This is why the add input codes routine can't just use the top of the command stack.

The second argument will just need to be the code for the token being processed. The Semicolon and Comma token handlers just need to pass their own code. The INPUT command handler will pass the Colon code, which will be interpreted as the end of statement. (Noting that the INPUT command handler could be called from any end of statement token like EOL, Colon, ELSE, and ENDIF; so using the Colon code is appropriate.)

Sunday, June 27, 2010

Translator – New Token Modes

Translating INPUT statements will require additional token modes. The current token modes are:

Command – Translator is expecting a command token (or start of an assignment)
Assignment – Translator is expecting an item for an assignment statement
EqualAssignment – An equal token was received when the mode was Command or Assignment; another equal token would indicate a multiple assignment statement (commas are not permitted)
CommaAssignment – A comma token was received when the mode was Command or Assignment; another comma token would indicate continuation of a multiple assignment statement (an equal token would indicate the end of the list and the begin of the expression)
Expression – Translator is expecting operands of operators depending on the current state

When a semicolon appears at the end of an INPUT statement, no further tokens should be received except for an end of statement token (EOL, colon, ELSE, and ENDIF). A new mode is required so the Translator can make sure no additional non end of statement tokens are received:

EOS – Translator is expecting an end of statement token only

An INPUT statement contains variable(s) that are to be input. Expressions are not allowed (except for the string expression after the PROMPT keyword, or within subscripts of array variables). The INPUT translation could be implemented to check if the token on top of the done stack has the reference flag set, and if not, report an “expected variable” error. But that could leave to strange errors being reported, consider this example (with the translation of the expression):

INPUT A*B+C A B * C +

The + will be on top of the done stack after this expression is translated (being the result of the translated expression). The + token will not have the reference flag set since it is an operator. The INPUT is expecting a reference, so it would report “expected variable” pointing to the + token. This would be very confusing – why would a variable be expected at the +? The correct error should be “expecting comma or end of statement” pointing to the * token. A new mode is required so the Translator can make sure no operators (except for end of expression operators comma, semicolon and EOL) are received:

Reference – Translator is only expecting reference tokens and end of expression operator

Reference tokens include tokens without parentheses and tokens with parentheses. However, these type of tokens could be variables, arrays or functions, but the Translator is not able to determine which. Therefore, the Encoder could still find errors if a user function was placed in an INPUT statement. Lastly, sub-string functions (while valid in assignment statements) are not valid in INPUT statements, therefore the Reference mode needs to check for these and report an error.

Translator – INPUT command

Similar to the translation of the PRINT statement, a lot of the translation work of the INPUT statement will take place in the semicolon and comma token handlers, with the INPUT and INPUT PROMPT command handlers being called at the end of the statement. The codes have been renamed for consistency and clarity, InputGet code will now be InputBegin, and InputPromptStr and InputPrompTmp will now be InputBeginStr and InputBeginTmp (the Tmp versions are not handled by the Translator).

There will be two command flags to keep track of the INPUT statement translation. The first is the InputBegin command flag, which indicates whether an InputBegin has been appended to the output yet. The semicolon and comma token handlers will use this flag to determine if the InputBegin code has been appended (INPUT) and whether a prompt string expression result is expected (INPUT PROMPT).

The semicolon will also use the InputBegin command flag to determine if it is at the end of the statement, which determines whether to set the second command flag, InputKeep. The InputKeep flag will be used by the command handlers at the end of the statement to determine whether the InputKeep sub-code flag should be set in the Input and InputPrompt codes at the end of the translated statement.

Saturday, June 26, 2010

Translator – Error Handling (Preliminary Release)

I decided to try using CVS branching for development of new releases. This way, working code (but not ready to release) could be committed to the CVS repository and development can continue. Differences made can then be easily checked. What has been done was good files were copied to a different file name, which is fine for one file, but when a whole set of files are involved, it becomes pain. This accounts for the strange CVS revision numbers in the source file, but should clear up once an official release is made.

Since the known error reporting issues have been resolved, this is a good time to make a preliminary release before the changes to implement the INPUT command begin. There are now many test inputs for testing errors (there should probably be more). The file ibcp_0.1.14-dev-1-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now implementation of the INPUT command can begin...

Translator – Error Handling (Corrected)

Between working on the development of the INPUT command this week, some time was spent on working out the remaining error reporting issues. Three major areas needed additional checks to report the desired errors:

Internal functions not inside expressions, except for functions that return a sub-string (which can be used in assignments) – this was an issue because the Translator would allow internal functions in assignments, but only the sub-string functions are allowed.
At the end of an expression if the mode is assignment, checks were added to see if the Translator was inside an array or sub-string function – this was an issue because the Translator was in expression mode for the subscripts/arguments, but the errors were related to the assignment, therefore the count stack was checked to determine the current situation.
There were several places where checks were needed to determine if the done stack was empty to report the desired error – this included the end of expression in comma assignment mode, comma in assignment mode, semicolon in assignment mode, and LET command with no variable after keyword.

One more check was required at the end of a statement (currently just end-of-line, but will be required for colon when that is implemented). If an EOL was received with the mode set to comma assignment (meaning more items in the list are expected – the mode is set to expression once an equal token is received), then an “expected equal or comma for in assignment” error is reported. A new end statement table entry flag was added for this.

The giant switch statement for the error messages in the test source file was not desirable. These eventually need to be placed elsewhere. Putting them in an array is the logical choice, but how to index the array? One way is to make sure the entries are in the same order as the token status enumeration, but this requires care to make sure it is correct and when it is not, the problem is difficult to debug. So a structure was added that contains the token status value and the string. During initialization, an index conversion array is set up, and during set up is validated to make sure there are none duplicated or missing – just like with the codes enumeration values in the table entries.

Since the error mechanism is almost identical to that used with table errors, instead of cloning this code (a common, but bad, practice of many programmers), an error template structure was implemented to be used for the token status message errors and table errors (the token status message errors only require the duplicate and missing error types). The routine to print the errors in the error list is also defined in the template in the include file. Putting calls to the fprintf() standard library function in this function is a bad idea, so a pointer to generic print function is passed to the error reporting template function. For now the print function passed just outputs to the standard error stream.

Friday, June 25, 2010

Execution – INPUT command (Revised)

Both the INPUT and INPUT PROMPT commands will share routines at run-time. Normally it will not be desirable to call extra functions or check sub-code flags at run-time as this uses execution time (there is overhead calling functions in C/C++). Better if these decisions can be made before hand. But since this is the INPUT command, execution time will not be critical. Therefore, there will be a function for outputting the “? ” (optionally based on a flag argument) and reading the input from the keyboard.

Instead of just pushing the location of the input get code to the evaluation stack (so that it can be re-executed upon an error), a flag for whether there is a prompt expression and a flag whether the prompt is a temporary string will be pushed along side the location. In other words, there will be an input stack item structure containing the location, prompt flag and temporary flag. So when the Input or InputPrompt code is executed, they will have access to whether there was a prompt string (that needs to be popped) and whether it is a temporary string (that needs to be deleted).

But is it necessary to know there is a prompt since Input will know that there is no prompt and InputPrompt will know there is a prompt (though it still needs to know if the prompt is a temporary string). Both Input and InputPrompt will need to check for an EOL in the input buffer, process the assignments on the stack, pop the location entry from the stack, and check if the cursor should be kept on the same line.

Since the temporary flag needs to be carried from the InputPromptStr and InputPromptTmp codes (why not also carry the prompt flag), and both Input and InputPrompt share a lot of the same work, both will share the same run-time routine. This is verses having three routines (one for Input, one for InputPrompt, and one for the common work). The single run-time routine will check for a prompt (which will be popped from the stack) and if it is temporary (which will be deleted).

The InputGet code will push the location with no prompt flag, and then call the get input routine with the question flag argument set. The InputPromptStr will push the location with the prompt flag set and no temporary flag, and then call the get input routine with the question flag argument set if the Question sub-code flag is set. The InputPromptTmp will push the location with the prompt flag and temporary flag set, and then call the get input routine with the question flag argument set if the Question sub-code flag is set. Now back to the implementation of the INPUT statement translation...

Thursday, June 24, 2010

Translation – INPUT command (Revised)

Now with the updated syntax to the INPUT statement with the addition of the optional PROMPT keyword, the translation needs to be revised. Since the PROMPT keyword will always follow the INPUT keyword, it will be easiest to handle it as the double keyword command INPUT PROMPT, in other words, it will be a separate command. The translations of these two statements will now be:

InputGet {<variable> InputType}... Input
<expr> InputPromptStr {<variable> InputType}... InputPrompt

This mirrors other commands where the command token is placed at the end of the translation. The InputGet code does not need any sub-codes as it will always output “? ” before getting input. The InputPromptStr will have a Question sub-code to determine if an “? ” should be output after the prompt string.

There will also be an InputPromptTmp code for when the prompt string expression is a temporary string, but this determination won't occur until the Encoder can determine if the string is temporary or not. This also mirrors how temporary strings are handled for other codes. If the Tmp sub-code flag was used, this would be a different scheme and would complicate the Encoder implementation of temporary strings. There is a way to avoid the Input or InputPrompt code needing a Tmp sub-code flag (these are where during run-time that a temporary prompt string is will be deleted).

Both the Input and InputPrompt codes will still have the Keep sub-code for when the cursor should not be advanced to the next line. Here are the revised examples originally shown here with their new translations:

INPUT A InputGet A<ref> InputDbl Input
INPUT PROMPT "Qty: ",A% "Qty: " InputPromptStr A% InputInt InputPrompt
INPUT PROMPT P$+C$,A$; P$ C$ +$ InputPromptTmp A$ InputStr InputPrompt'Keep'
INPUT PROMPT "",A,B% "" InputPromptStr A<ref> InputDbl B%<ref> InputInt InputPrompt

Note there is no equivalent to the last INPUT example command that disabled the prompt – the INPUT PROMPT will have to be used with an empty string as shown above. This should not be an issue since prompt-less INPUT statements will not be common. Also notice that there are three codes that will be outputting a prompt and getting the input. Next, how execution is affected by these changes...

Wednesday, June 23, 2010

Language – INPUT command (Change)

As the translation of the INPUT command was being designed, a serious flaw was uncovered with the current syntax. This can be shown with an example. Say there is a user function named Prompt$ that takes a single argument, in other words FUNCTION Prompt$(Index). The intention is to use this function as the prompt string expression in an INPUT statement, which looks like this (using the comma separator to suppress the addition of the default question mark prompt):

INPUT Prompt$(I),Array(I)

To the Translator, the Prompt$ is a token that has a parentheses, which could be either an array or a user function. Requiring parentheses to be inserted around Prompt$(I) to force it to be detected as a prompt expression instead of a variable is not acceptable. The translation of the above is different depending on whether it is an array or a user function:

Array: INPUT'Question' I Prompt$(<ref> InputStr I Array(<ref> InputDbl Input
Function: I Prompt$( INPUT'Prompt+Question' I Array(<ref> Input Dbl Input'Tmp'

These two translations are radically different, so one can't assumed and then changed later by the Encoder when it has determined whether it is an array or a function. The Translator needs to know whether it is a prompt or not. If it knows that it is a prompt, then it does not need to know whether it is an array or a function (the translations are the same). For a semicolon, this is not an issue because the item before semicolon is a prompt (unless it is at the end of the line).

To resolve this issue, there are two alternatives. Either the comma separator be eliminated (where a semicolon only indicates a prompt and suppresses the “? ” leaving no way to add the “? ” except to actually add it to the prompt – not desirable), or use the ANSI BASIC syntax of the PROMPT keyword to indicate a prompt string expression follows. The latter will be used and the functionality of the comma (no “? ”) and semicolon (add “? ”) will remain.

Using the PROMPT keyword is the more logically because it clearly shows that there is a prompt, and this keeps with the spirit of a beginner's language (which is probably why it is part of ANSI BASIC). Therefore the syntax of the INPUT statement will now be:

INPUT [PROMPT <string-expression> {,|;}] <variable>[,<variable> ...][;]

Next, how this will affect the resulting translation and execution of the INPUT statement...

Tuesday, June 22, 2010

Execution – INPUT command (More Details)

The characters in the input buffer are parsed into tokens separated by commas. If there is a double quote at the beginning of a token, then a string constant is scanned for a matching closing parentheses. The surrounding double quotes will not be included in the string value. Two double quotes within the string constant are treated as a single double quote. If the token begins with a numeric constant character, then an attempt is made to see if the entire token is a valid double or integer constant (otherwise it is considered a string constant). After each value token is parsed, the status flag is set to Comma or EOL depending on what terminated the value.

The InputDbl code will check for a double or integer value, otherwise an error occurs. An integer constant is converted to a double. The InputInt code will check for a double or integer value, otherwise an error occurs. A double constant is converted to an integer if possible, otherwise an error occurs. An InputStr code will accept any value type. For double and integer values, the characters entered are used. Double quotes are needed around a string value that contains a comma.

When an error occurs, everything pushed to the evaluation stack needs to be popped up to the prompt (if there is one) and execution needs to repeat the INPUT statement. The reason the location of the InputGet code is pushed to the stack is so that execution can resume there upon an error. Any string prompt on the stack is not popped so that it can be reused.

The code the error occurs at (an InputType or Input) and the input counter value will determine how many entries need to be popped from the stack. For an InputType code, an error would occur after the variable reference is pushed to the stack but before the value and assignment function pointer. For InputType, the number of entries to pop is count*3+1. For Input, the number of entries is count*3. Once these entries are popped, the location of the InputGet is popped. The string “Redo from start” is output and execution continues at the location of the InputGet.

Notice that at no time during the execution of the INPUT statement, including error processing, is it required to know what type of value is at any stack entry – it already knows. Next, onto the translation of the INPUT statement...

Monday, June 21, 2010

Execution – INPUT command

To prove that the selected translation is appropriate, consider what occurs during execution, here are some example INPUT statements along with their translations:

INPUT A InputGet'Question' A<ref> InputDbl Input
INPUT "How many: ",A% "How many: " InputGet'Prompt' A%<ref> InputInt Input
INPUT P$+C$,A$; P$ C$ +$ Input'Prompt' A$<ref> InputStr Input'Tmp+Keep'
INPUT,A,B% InputGet A<ref> InputDbl B%<ref> InputInt Input

If there is a prompt string expression, its result will be on top of the evaluation stack. At the InputGet code, if the Prompt sub-code flag is set, then the string on top of the stack is output. If the Question sub-code flag is set, then a “? ” is output. The location of the InputGet code is pushed to the stack (to be used if an error occurs so that the InputGet can be returned to). The characters are input into a buffer until an Enter key terminates the input. There will be a status flag variable to keep parsing status of the input (initialized to None) and a value counter variable to count the number of input values (initialized to zero).

A variable reference is pushed to the evaluation stack. At an InputType code, the status flag is checked for a None or a Comma status, otherwise an error has occurred. The input buffer is scanned for a value token. If the value is the correct type, or can be converted to the correct type, then it is pushed to the stack, otherwise an error occurs. Finally, a pointer to an assign code run-time function for the data type is pushed to the stack.

At the Input code, the status flag is checked to make sure that it is set to EOL (a Comma status indicates there is more input, which is an error). An assign function pointer is popped from the evaluation stack and then called, which assigns the value to the variable that is the stack (both are popped by the assignment). This repeats for each value until the input value counter is decremented to zero. The InputGet code location is popped from the stack and finally the prompt string is popped from the stack. If the TmpStr sub-code flag is set, then the temporary prompt string is deleted. If the Keep sub-code flag is not set, the Print function is called to advance to the next line. Next, some more execution details...

Sunday, June 20, 2010

Translation – INPUT command

Without going the entire process in how the translation was arrived at, here is the general form of the translated INPUT statement (optional items in square brackets, repeating items in braces):

[<expr>] InputGet {<variable> InputType}... Input

The InputGet code will output an optional prompt string (evaluated by the preceding string expression and pushed to the evaluation stack) and the “? ” if selected, and then will get the input. The InputType codes will parse the input for a appropriate data type. Any errors cause a loop back to the InputGet code (where the prompt is issued again). When there are no errors, the Input code will assign the values to the variables, pop the prompt string from the evaluation stack, delete the string if it is a temporary, and advancing the cursor to a next line if not suppressed by a trailing semicolon.

The InputGet code will use sub-code flags to indicate whether there is a prompt string to output and whether to output the “? ” string. The Input code will use sub-code flags to indicate if there is a temporary prompt string that needs deleted and whether the cursor should be kept on the same line.

Normally sub-code flags are only used by the Recreator when generating the original source, and ignored during run-time (so as not to waste execution time checking them). But, considering that this is the INPUT command, a command that is about to stop and wait for an extremely slow user, using time to check sub-code flags is not going to have any significant impact. Next, a more detailed look at how the INPUT command will be executed...

Internal Code – INPUT command (Revised)

Upon considering the resulting translation of the INPUT statement, I realized there is an efficiency that can be made. The prompt string expression only needs to be evaluated one time. The resulting string can be left on the evaluation stack until the end of the INPUT statement. This way, if the prompt is needed again due to an input error, the string expression result is available and does not need to be evaluated again.

This means that the PrintStr or PrintTmp code can not be called because both will pop the string from the evaluation stack and PrintTmp will delete a temporary string. After the values are assigned, then the prompt string can be popped from the evaluation stack and deleted if it is a temporary.

There is one other aspect of the INPUT statement that needs to be considered. There is a built-in looping required to deal with errors. Not having to re-evaluate the prompt string expression means that the loop back jump does not have know where string expression codes begin. Next, what the INPUT translation will look like...

Internal Code – INPUT command

Before the INPUT command can be implemented in the Translator, the translation of the INPUT statement translation must be determined; and before this, how the INPUT command will be executed at run-time. The steps of the INPUT command at run-time are:

Print the prompt string is present
Print the “? ” prompt (if selected)
Get the input
Parse a value from the input for the data type of the variable
Save the value and repeat for each variable
If there is an error with a value or to many or too little values
Then print “Redo from start” and go back to the first step
Assign each variable to its corresponding value
Move the cursor to the next line (unless there was a semicolon at the end)

The run-time code for the PrintStr code can be called to output the prompt string (step 1). If the prompt is a temporary string, then PrintTmp needs to be called (identical to PrintStr except deletes the temporary string when done). But like all other code that take string operands, the exact code can't be determined by the Translator.

Like with the PRINT statement, there will be separate codes for parsing the input for a particular data type - InputDbl, InputInt, and InputStr (step 3). The values can't be assigned until all the values have been parsed and are valid, so the values need to be saved. Pushing them onto the evaluation stack right after the variable's reference is the logically thing to do.

Once all the values are input, the values can be assigned to the variables (step 8). The run-time code for the Assign data type can be used since the values will be on the stack (variable reference first and then data value). If there no semicolon at the, the cursor needs to be advanced to the next line. The run-time code for Print can be used (step 9).

Saturday, June 19, 2010

Language – INPUT command (More Details)

Allowing a string expression for the prompt will give flexibility. However, if using a single string variable for the prompt's string expression can lead to possible confusion. Consider these INPUT statements where the variable P$ is meant to contain the prompt string:

INPUT P$;A,B
INPUT P$,A,B
INPUT A$;

For the first statement, it is easy to determine that the P$ is a prompt since a semicolon is separating the prompt from the input variables. However, in the second statement, is the P$ the prompt or is the first of the three variables to input? The later will be assumed because otherwise there would be no way to input a string as the first of two or more variables.

The Translator will determine if the first parameter is a string expression or a string variable by checking if the reference flag of the token is set. The reference flag will not be set for an actual string expression. To override this behavior, the statement can be written as “INPUT (P$),A,B” where the parentheses will clear reference flag of the P$ token forcing it to be interpreted as a string expression and therefore as a prompt.

In the third statement, since no variables follow the semicolon, this statement will be interpreted as inputting a single string variable, keep the cursor on the same line after the input is entered.

I have been doing some reconfiguring/upgrading of my computer (adding a larger and faster hard drive), which didn't go as smooth as it should have. At the same time a board was removed and the system wasn't stable until it's drivers were removed (duh). And there was an error on the system partition that proved to be a great pain to correct until the correct procedure was figured out. Finally the system partition corrected, defragmented and copied to the new hard drive. Work can now continue on this project.

Thursday, June 17, 2010

Language – INPUT command

After looking at QBasic (GW-Basic, QuickBasic and FreeBASIC have the same features on the INPUT command) and TrueBASIC (similar to ANSI Basic) for inspiration, a set of desired features for the INPUT command was selected. There's no reason to get too elaborate with the INPUT command as there is a better solution for inputting values (more on that later). The syntax chosen for the INPUT command implementation is:

INPUT [[<string-expression>] {,|;}] <variable>[,<variable> ...][;]

QBasic (and GW-Basic, etc.) supports an optional prompt string, but it must be a string constant. TrueBASIC (ANSI BASIC) allows a string expression but it is preceded by the PROMPT keyword (followed by a colon and the "? " is suppressed). ANSI BASIC has some other options in additional to PROMPT (perhaps TrueBASIC does now too, but not in the 1985 version).

A string expression will be used for the prompt so that anything can be used for the prompt, not just a string constant (which is too restrictive). Similar to QBasic, a comma separator after the prompt outputs just the value of the string expression, and a semicolon will add "? " after the prompt. If there is no string expression or separator, then the standard "? " is output. The prompt is optional, and if there is just a separator, then there will just the "? " (semicolon) or no prompt (comma). Using just the semicolon is pointless because it is the same as not having the semicolon. Using just the comma will have no prompt (which gives errors with both QBasic and GW-Basic).

The trailing optional semicolon will keep the cursor on the same line after the input is entered. QBasic (and the others) accomplish this with the optional semicolon directly after the INPUT keyword. This is weird and it makes more sense to put the optional semicolon at the end just like with the PRINT statement.

Note: Debugging will resume on the error reporting issues, but I have little time available during this week and wanted to get started with the background for the INPUT command.

Tuesday, June 15, 2010

Translator – PRINT Command (Release)

The original plan was put all the command test inputs into one set, but the set grew quite large when the PRINT tests were added and especially when all the error test inputs were also added. Therefore, new separate sets were created for the PRINT tests and the error tests. The lone PRINT statement at the end of the LET command tests remains (though it no longer causes a “BUG: not yet implemented” error).

One last thing, it was previously mentioned that the standard C libraries being used with GCC include a command history of the inputs entered in the interactive test input mode, accessed by the up and done arrows. I accidentally discovered a few more features using the function keys (found when F7 was entered in the MSYS window instead of the VIDE2 window where it is the build program shortcut key). These features are:

F1 – Copy one character from previous input at current position
F2 – Pops up a window asking for a character, then copies characters from current position from previous input up to that character
F3 – Copies all characters from previous input at current position
F4 – Pops up a window asking for a character to delete to, then deletes up to the character
F5 – Does the same thing as up-arrow
F6 – Puts Control-Z into buffer (not sure what this is for, possibly Control-Z is an end of file marker from the old DOS days)
F7 – Pop up a window showing are previously entered inputs (each numbered), which can then be selected
F8 – Display previous input (same as up-arrow except rotates around, but only when cursor is a beginning of line)
F9 – Pop up window asking for a Command Number

The Page Up and Page Down go to the first and last inputs saved. I also noticed it seems to remember the inputs even between runs, I'm not sure where it stores this between terminating the program and starting it back up.

Anyway, the Translator now has support for the PRINT command and PRINT functions and ibcp_0.1.13-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. A lot of changes were required to get the PRINT command implemented, the implementation of next command (INPUT) should be easier. Now to get the rest of the error handling working correctly...

Translator – PRINT Working, But...

This release seems to be living up to its unlucky number. I'm going to just put it out shortly and move on. The PRINT command and its error tests are working properly, so there's no reason to do an interim developmental release. However, not all the error messages are correct. But first, a few changes need to be described.

A while back, an AssignList flag was added to table entries to identify that the code is an assign list code. This flag was used when processing an operator to identify it as an assign list operator. A temporary flag is set if the operator is an assign list operator before the find code function was called (which only handles the last value in the assignment list and the expression). Upon returning, if this temporary flag is set, then the rest of the assignment list operands are processed.

It turns out this table entry flag is not necessary. Since the temporary flag is set before the find code function (which may change the operator code depending on the data types), the temporary flag can simply be set if the operator code is AssignList.

One other change was made related to list assignments. Previously the AssignList operator was not pushed to the hold stack until the equal token was received, the comma tokens were just deleted. This was changed to push an AssignList Operator token to the hold stack when the first comma token is received (the comma token was changed). This occurs when the mode is changed from Assignment to CommaAssignment (on the first comma token, subsequent comma tokens are still deleted). When the equal token is received, the mode is still changed from CommaAssignment to Expression (but the equal token is now just deleted). This change was necessary to aid in identifying which specific error message is needed when an error occurs (which is still not working correctly).

Monday, June 14, 2010

Translator – PRINT Command (Still Testing)

Wow, as the remaining errors were corrected, I thought of more errors to try, and of course they did not work either and needed to be corrected. Right now there are three more errors that are not being reported correctly (and none of these were known before this evening). All of the previously known error reporting issues had been corrected plus a few more that were discovered. The various error test inputs were put into there own group, which has grown to 79 different tests (though there maybe a few duplicates). Hopefully one more day...

Sunday, June 13, 2010

Translator – PRINT Command (Testing)

Sorry for the lack of posts but it has been quite an effort trying to get the desired error messages for the different errors. The simple approach would be to simply report “Syntax Error” like GW-Basic does for errors, but that defeats the idea of a Beginner's programming language – it should be helpful. After seeing how well QBasic identifies errors (though not perfect), the decision was made to do an even better job than QBasic.

There has been a little backtracking on the scheme that all error codes (and messages) should be unique. There was starting to be a lot of similar messages except for subtle differences like a “; not semicolon” or “; not comma” at the end. The decision was made that this was unnecessary, since the location in the code could be identified by the token being pointed to (comma, semicolon, or EOL), so there's really no point to having multiple unique errors for these.

It would require too much space to try an explain all the details of the changes, but here is a summary of the major changes:

Print-only functions are checked if invalid immediately upon receipt instead of waiting until the function is processed at the closing parentheses (without this, a statement like A=TAB(B$) would report that the argument is invalid possible implying that the TAB is okay in the assignment).
The counter stack was modified to hold both the operand counter it previously held and a new number of expected arguments value that is set for internal functions. This value is used to determine if another argument is expected or no more arguments are expected for a given internal function.
The comma and semicolon token handlers were reorganized for better error reporting.
The expression end check function was modified to check if a comma or closing parentheses is expected for internal functions (complicated by functions having different number of arguments).
Many checks of the current mode were required in the different locations in the Translator code so the correct error could be reported.

There is still a few errors that are not being reported as desired. A little more work is required to clean these up and a release can finally be made...

Thursday, June 10, 2010

Translator – PRINT Command (Testing)

A statement like “Z=A+” was reporting the “BUG: expected operand on done stack” error because the add operator (emptied at the EOL) consumed two operands from the done stack, the A and then the Z. The EOL empties the assignment operator next, which is expecting two operands, but there was only one, the result of the add operator, thus the bug error.

To fix this problem, a new FirstOperand state was needed. This state is set at the beginning. After the first operand, the Operand state is used (except after an assignment operator, comma in a PRINT statement, and semicolon in a PRINT statement). When an operator is received in operand mode that has the end expression flag set (comma, semicolon and EOL), only if the state is not FirstOperand (meaning operands have been received yet) does an error occur. The error message was changed from “operand expected” to “expected expression” since more than just an operand is expected (unary operator, parenthetical expression, etc.).

Upon more comparing with Qbasic's error checking behavior, it became evident that the current error messages could be more helpful in explaining what is expected. Therefore many of the error messages were updated and some new messages were added. In keeping with the current scheme, all error codes (and messages) are unique – in other words, occur in only on location. This makes it much easier to identify what part of the code generated the error.

Many more test inputs were added with all kinds of errors to see if the error messages produced was sufficient and pointing to the proper token. The same statements were tried with QBasic. In some cases, I feel the QBasic message is insufficient. There was even one error statement “A B = 1” that QBasic did not catch as an error until the program was run (oops). There are still more adjustments to the error messages required (including two bug errors that are still occurring and two that are not pointing at the correct token). Hopefully this will be cleaned up shortly and the next release, with the PRINT command, can be made...

Wednesday, June 9, 2010

Translator – PRINT Command (Still Debugging)

And speaking of QBasic, out of curiosity of how it handles invalid commas in function arguments for comparison to IBCP, QBasic detects the wrong number of arguments at the first comma that shouldn't be there (the “expected )” error occurs pointing to the comma) or the closing parentheses when there should be another argument (the “expected ,” error occurs pointer to the closing parentheses. IBCP doesn't catch the error until the closing parentheses (and points to the function name with the “wrong number of arguments” error). The QBasic behavior is better.

This behavior was not difficult to replicate. The count stack (that keeps track of operands, i.e. commas) was modified to hold a count stack item, which contains the current number of operands (commas) counter and a new expected number of arguments that is set if the token is an internal function. The comma token handler then checks if too many arguments have been entered if the expected number of arguments is set.

To make this work with internal functions that have multiple number of arguments (MID$, INSTR and ASC), the entries in the table needed to be rearranged where the code with the more number of arguments moved first, so that the number of expected operands was set high enough not the have an error reported before getting to the closing parentheses.

One more bug needs to be fixed, a statement like “Z=A+” reports the “BUG: expected operand on done stack” error. Bug errors are programming errors that should not occur. The statement “Z=(A+” does correctly report the “operand or closing parentheses expected” error. This appears to be the final bug...

QBasic (Another Incremental Compiler)

I discovered that QBasic 1.1 is available as a free download. Version 1.0 came with DOS 5.0 as a replacement for GW-Basic and with Windows 95, NT 3.x and NT 4.0. Version 1.1 came with DOS 6.x, Windows 95, 98, and Me. QBasic was based on QuickBasic 4.5. I was fascinated with the features of QuickBasic 4.5 when it came out, the incremental compiler aspect was intriguing, but did not think they took the user interface concepts far enough (one of the intentions of IBCP).

After getting familiar with QuickBasic 4.5, I never used it much afterward since it's compiled programs were much slower than when compiled with QuickBasic 3.0 (not an incremental compiler) for the project I was involved with at the time and continued to use GW-Basic for development and the QuickBasic 3.0 compiler for production.

QBasic does not contain any of compiler or linker elements of QuickBasic 4.5, however, the incremental compiler remains. The lines are compiled as they are entered and errors are reported immediately. QBasic is good inspiration for IBCP and will be used for comparison, both for features and run-time (though possibly unfair since QBasic is not 32-bit). QBasic will be mentioned frequently from here on (along with GW-Basic, QuickBasic 3.0 and FreeBASIC), though there is no intention of making IBCP a clone of QBasic (see FreeBASIC for that, which is just a compiler).

Tuesday, June 8, 2010

Translator – PRINT Command (More Debugging)

A dummy semicolon token was being appended after a print-only function, which is not necessary. A new semicolon sub-code flag was added for this case along with a new print function command flag, which is set when a print-only function is appended to the output. If the semicolon token handler sees this command flag set, it will set the semicolon sub-code flag of the print-only function token (the last token appended to the output). This command flag is cleared by the semicolon and comma token handlers to prevent further semicolon from seeing it set.

This change was also necessary for print-only functions at the end of a statement, because there was no translated difference between a print-only function at the end or one followed by a semicolon. To reproduce the ending semicolon, the semicolon sub-code flag is set when there is a semicolon after the print-only function at the end of the statement (the semicolon is not necessary since the print-only functions at the end keep the cursor on the same line).

The check for the end expression flag (set for comma, semicolon and EOL tokens) was put at the beginning of the state is operand or operator check to skip this section when the flag is set. The check was necessary because otherwise an “operand expected” was occurring when a comma or semicolon followed another comma or semicolon, or one of these was at the end of the statement. In this case, the Translator was expecting an operand when the comma, semicolon or EOL was received.

After several failed attempts to get this right, the check was removed to see what error came up for the different test input errors that have been added to test all the possible cases. In all cases, the “operand expected” occurred. This pointed to the location where the end expression flag needed to be.

The location for the check was in the operand section between the “is not an operator” and the “is a unary operator” sections (remember that unary operators are accepted when the Translator is expecting an operand). Making the comma, semicolon and EOL codes unary operators to get is past this section did not work (they were processed incorrectly later in the code), so a separate check was added.

For the end expression token check, a new “operand or closing parentheses expected” error was added (neither “operand expected” nor “missing closing parentheses” made sense for all cases where this error can occur). The error only occurs if the counter stack is not empty (inside a parenthetical expression or inside an array or function), otherwise the error will be caught by the respective token handlers. Debugging continues...

Sunday, June 6, 2010

Translator – PRINT Command (Debugging)

The code that handles checking the done stack and appending the necessary data type specific code was also put into a new add print code function so that it can be called from the comma token handler, semicolon token handler and new print command handler. Once the changes were implemented and made to compile, testing was started with the new regression test script.

The first problem encountered was that at the end of the line for the expression tests reported that the done stack was not empty, which was true because the expression only mode leaves the result on the done stack. The code was modified to save the expression mode from the start function and the end-of-line token handler checks for the expression only mode and makes sure there is one item on the done stack.

The next problem encountered was that the single PRINT command (the last test of the current test inputs) reported an “expected operand” at the end of the line. This occurred because after the PRINT command token was received, the mode was set to Expression and the state was set to Operand. The EOL token received next was considered an operator and therefore caused the error.

To correct this, a new EndExpr table entry flag was added and set in the EOL code table entry. This flag is also needed for the Comma and SemiColon codes and will be used for other codes that end expressions (for example, Colon, THEN, ELSE, END IF, etc.). If a token is received that has this flag set, then the normal operand/operator sections are skipped. The stack emptying section follows next where these codes will empty most of the operators from the hold stack. All of these codes will have token handlers.

Once all the existing test inputs worked, several PRINT test inputs were added. The next issue discovered was that both the comma and semicolon token handlers must switch Translator state back to Operand from Operator (an “expected operator” error was occurring). The are many other bugs in the PRINT statement code; debugging continues...

Saturday, June 5, 2010

Translator – Command Handlers

The command handlers will be called at the end of the statement, which the end-of-line token handler is current performing. This will change when the end-of-statement is implemented (soon with the colon operator). For now, the end-of-line token handler will call the command handler for the command on top of the command stack.

The command handler functions are friends of the Translator class, like the token handler functions; necessary so that pointers to these functions can be put into the table entries. The interface to these functions are similar to the token handlers except a pointer to the command stack item is passed for the second argument instead of a reference to a token pointer. This gives the command handler access to the code and flag information in the command stack item in addition to the token pointer of the command. If an error occurs and a different token needs to be pointed to for the error, the token pointer can be changed (the original token needs to be deleted first).

Command handlers will now be in charge of checking if an expression has been ended correctly and will make sure the done stack has been emptied. The end-of-line handler is responsible for:
Checking if the command stack is not empty (otherwise a “command stack empty” bug error occurs).
Checking if the command on top of the command stack has a command handler (otherwise a “not yet implemented” bug error occurs).
Popping the command item from command stack and calling its command handler.
Popping the initial null token from the hold stack.
Checking if the done stack is empty.
Checking if the command stack is empty.
Deleting the EOL token and return a done status.

Some of this functionality will be moved to the end-of-statement routine and the end-of-line token handler will be changed once multiple statements per line is implemented. There also needs to be an assignment command handler that does nothing more than return a good status. Now on to implementing the PRINT command handler function...

Friday, June 4, 2010

Translator – End of Statement Processing

The end-of-line processing currently performs the check if an expression was ended correctly, that is, checks if the hold stack contains only the initial null token (meaning all operator have been processed). The comma and semicolon should empty all operators from the hold stack, except lower precedence opening parentheses, functions, and array token. Except for the null token which acts as a buffer, if any of these are still on the stack, there is a missing closing parentheses. Finally, any pending parentheses needs to be processed.

This end of expression code needs to called by the comma and semicolon token handlers, so it will be moved into its own function (duplicating code is a bad programming practice). This function will also be called by future token handlers. In other words, it needs to be called at the end of a statement. The end-of-line is also the end of a statement, which currently contains this code.

However, the end of expression should not be called by the end-of-line token handler, which should instead be calling the command handler for the command that is currently being processed (on top of the command stack). The command handler will then do the end of expression call if required (some commands don't have expressions). The end-of-line token handler will pop the command from on top of the command stack and call the command's command handler.

Currently, when an assignment operator is appended to the output, the pointer to the output item is pushed to the done stack. There is no reason to push this to the done stack because it is not needed there (it is just popped at the end of the statement). The reason for pushing onto the done stack is so that the output item can be referenced by another token (for example, operands of functions for checking data types).

Remember the LET command was popped from the command stack (if there) by the assignment operator. The assignment operator, should instead be pushed onto the command stack since it is a command. It will be more clear why once the compound commands are implemented (like IF-THEN-ELSE). The assignment, being a command, will not get popped from the command stack (like all commands). The assignment command doesn't need a command handler, so it will have a NULL command handler function pointer.

Thursday, June 3, 2010

Translator – Commas with PRINT

Currently a comma is not valid within an expression if the comma is not inside an array or function. The comma has a low precedence, below all operators, but above end-of-line, close parentheses and commands, so it will empty all other operators from the hold stack. No lower precedence tokens will be on the hold stack except for an open parentheses, identifier with parentheses (function or array), function (internal or defined), or an assignment operator.

The comma token handler will be updated to check for a PRINT command on top of the command stack. Any other command will still cause an “unexpected comma in expression” error to occur. The hold stack needs to be checked to make sure it is empty (has the initial Null token on top), otherwise a “missing closing parentheses” error occurs. This check is the same that is performed for in semicolon token handler and also in the end-of-line token handler, so this code will be put into its own function for all to call. An assignment won't be on the hold stack if there is a PRINT on the command stack.

For a PRINT command, the handler will process any expression on top of the done stack. The expression on the done stack will be handled by the find code function as previously described. The flag for the PRINT command item on top of the done stack needs to be set in case this is the last comma in the PRINT statement, to tell the PRINT command handler not to append the Print code to the output (to keep the cursor on the same line).

If the done stack is empty, then there was no expression before the comma (allowed), which means to skip to the next column. The comma token will be appended to the output. For now, the PRINT command is the only reason a comma token is put into the translated output, therefore the execution routine called during run-time for the comma will be to advance to the next column. There is no need for a PrintComma code unless there will be another reason to put a comma in the program.

Wednesday, June 2, 2010

Translator – Semicolons with PRINT

Up to now, the Semicolon (an operator) has not been implemented (in fact, causes the program to crash if entered in an expression because its table entry has no expression information structure). The semicolon will have a low precedence, the same as the comma since it is also an expression separator, so it will empty all other operators from the hold stack. No lower precedence tokens will be on the hold stack except for an open parentheses, identifier with parentheses (function or array), function (internal or defined), or an assignment operator.

The Semicolon will have a token handler function, which is called after the hold stack is processed. The handler will check the command on top of the command stack. If the command stack is empty, then an “unexpected semicolon” error will occur. For now, only a PRINT command needs to be processed. Any other command will also cause an “unexpected semicolon” error to occur. The hold stack needs to be checked to make sure it is empty (has the initial Null token on top), otherwise a “missing closing parentheses” error occurs. An assignment won't be on the hold stack if there is a PRINT on the command stack.

For a PRINT command, the handler needs to process any expression on top of the done stack. If the done stack is empty, then there was no expression before the semicolon (allowed). In this case, a dummy semicolon token is appended to the output so that it can be reproduced. Adding a sub-code for this situation is a wasted effort (because there may be no token before the semicolon in the output, which would complicate matters in using a sub-code).

The expression on the done stack will be handled by the find code function as previously described. The flag for the PRINT command item on top of the done stack needs to be set in case this is the last semicolon in the PRINT statement – to tell the PRINT command not to append the Print code to the output (to keep the cursor on the same line).

The PrintTmpStr (which will be named PrintTmp to stick with the 3 character data type naming convention for code names) is not being added at this time. PrintStr is a code that takes a string operand, and as such, whether it's operand is a String or a TmpStr can't be determined by the Translator, so it will have it's operand saved for later determination by the Encoder.

Tuesday, June 1, 2010

Translator – Print-Only Functions

There are currently two print-only functions, SPC and TAB. Both functions may only be used in a PRINT statement and are not valid in any other expression. There needs to be a new table entry Print flag that will be used for these print-only functions, which are defined as internal functions.

When internal functions are processed (arguments checked and necessary hidden conversion codes inserted), a check is needed that if a function has the Print flag, then the current command on top of the command stack must be a PRINT command otherwise an “invalid use of print function” error occurs. In additional, print-only functions do not have return values, so no result value is pushed onto the done stack. These print-only functions also need to set the flag in the PRINT command item on top of the command stack in case these are the last item in the PRINT statement (to prevent the PRINT code from being appended to the output, which will keep the cursor on the same line at run-time).

During execution at run-time, these functions perform their action directly to the output, which is the reason no result needs to be pushed back on the evaluation stack (the result is outputted instead).

Bug Fix:Upon making the changes for print-only functions, I realized there was an unrelated bug where a command would be accepted after receiving an operand (for example: A LET ...) because the mode was still set to Command. Therefore, when an operand token is processed, if the mode is Command then the mode is set to Assignment – this will prevent a command token from being accepted after an operand.