Interactive BASIC Compiler Project: 2011

Saturday, June 25, 2011

Switch in Linux Distribution

The 64-bit Kubuntu 10.10 installation has been problematic. Several comments around the Internet have indicated this to be the case. The reason Kubuntu was chosen was that it contained the KDE desktop, and 10.10 was chosen because of it is the newest on the supported host operating system list for VMWare Workstation (being used for the Windows XP virtual machine). Standard Ubuntu comes with the GNome interface, which I am not a fan of.

The problems with Kubuntu mainly include general sluggishness, a problem that was never resolved after a little over a month of use (something not expected on a Phenom II quad code running at 3.2 GHz with 4GB of memory with a NDIVIA GTS-450). It would completely hose up at times (not lock up or crash) when doing even simple tasks like uncompressing files. Even after all the updates were applied (less the update to 11.04 Natty Narwhal), including KDE and all the fancy desktop effects were turned off, the problems persisted. Mounting a bunch of NTFS partitions (from the previous Windows XP installation) was also suspected to be a cause.

Therefore, the Linux Mint 10 (Julia) with KDE distribution, which is also based on Ubuntu 10.10 (Maverick Meerkat) was chosen for the next installation. Linux Mint is another distribution based on Ubuntu/Debian. All the various commands mentioned in previous posts for Kubuntu also apply to Mint.

However, several issues were discovered in the various required tools that were not mentioned previously. There are some additional packages required for building GCC 4.6.0 and its prerequisite libraries (the GNU autotools) and for building CMake (specifically the ccmake GUI). The Linker Problem on Linux and New Make System posts were updated with new information about these dependencies. After a day of using Linux Mint 10, it is running as expected for a modern operating system on a fairly decent (though not cutting edge) system.

Thursday, June 23, 2011

Using CMake on Windows

The latest version of CMake (2.8.4) for Windows was installed, which can be obtained from here (the Win32 Installer). The same version used on Kubuntu 10.10. On Windows, there is a CMake (cmake-gui) program installed in the CMake 2.8 start menu group and it took some effort to figure out how to use it. Here is the procedure for building the program, starting in the CMake (cmake-gui):

Select "Browse Source..." and point to where the source is
Select "Browse Build..." and point to where the build will be (which should not be the source directory)
Optionally there is a "Make New Directory" to create a "build" directory
The next step is to select the "Configure" button, however, if the "Current Generator:" does not say "None" then select File/Delete Cache to make it change to None.

Before doing these steps, CMake needs to be able to find the GCC compilers and other tools. The Windows environment variable PATH needs to be modified, which can be done by inserting "C:\MinGW\bin;C:\MinGW\msys\1.0\bin;" at the beginning of PATH (assuming the default directories were used in the MinGW installer). Upon selecting "Configure" in Step 4, CMake will pop up a dialog. Change the generator for the project to "MSYS Makefiles" and select the "Finish" button. If all goes well, some lines will be output in the bottom section with no errors (errors are output in red text).

Now select "Generate" and the make file will be created. In an MSYS window, change to the build directory and do a "make" command. The program should now be built. All the generated files will be in the build directory including the executable, the auto-generated header files and the codes.txt file.

Tuesday, June 21, 2011

Using the CMake System

There are a few simple steps for building using the CMake system. It's worth noting that when using CMake, all the output files are put into a separate build directory. This prevents the source directory from getting cluttered with files. Assuming that the source files are in the ibcp sub-directory, an ibcp-build directory will be created. These commands are used for building the project:

mkdir ibcp-build
cd ibcp-build
cmake ../ibcp
make

If no errors occur from Step 3 (like the correct version of GCC was not found), then Step 4 will build the program. All auto-generated include files, object files and the program will be located in the ibcp-build directory. To enable debug information in the executable, the ccmake program is used with these steps:

ccmake ../ibcp
Press Enter on the CMAKE_BUILD_TYPE line
Type in "Debug" and press Enter
Press [c] to configure
Press [g] to generate and exit
Now entering make will build with debug information

Setting the string to "Release" in Step 3 will turn off the debugging information. There is also a "make clean" option. If the build is not performed in the source directory (possible but not recommended), the entire contents of the build directory can also be deleted and cmake rerun. Now that CMake is working on Linux, it is now time to make it work under Windows...

Monday, June 20, 2011

The CMake System

The CMake system consists of a CMakeLists.txt file that describes how to generate the make file that will be used for building a project. The initial CMakeLists.txt file consisted of these major sections:

Check that the GCC version is 4.5 or later.
Find the awk program.
Pass version information to the IBCP source.
Create custom commands for the auto-generated header files.
Define the lists of header and source files.
Add linker definitions for the static library linking.
Define the executable.

Step 1 is needed to make sure the needed version of the GCC compiler is installed and setup correctly. Step 2 checks if the awk program is available and gets it's path - the awk program is needed to create the auto-generated header files. Step 3 is to pass program version and copyright information to the source program. This is accomplished by an ibcp_config.h.in file with definitions set to CMake variables to create an ibcp_config.h file that will be included in the source.

Step 4 contains two custom commands to run awk to create the two auto-generated header files. The awk scripts were modified to take a command line argument for the path of where the output files are to written to since the awk scripts will be run in the build directory, not in the source directory. This path argument was made optional and if not provided, the awk scripts assume the current directory. In the CMakeLists.txt custom commands, the source directory path is passed to the awk scripts. The auto-generated header files are now written into the build directory (CMake calls this the binary directory), not the source directory, so a statement was also needed in CMakeLists.txt to include the binary directory to the list of paths that are searched for header files.

In Step 5, the list of header and source files are put into CMake variables. These variables are used to define what the dependencies are and are used to build the executable. Step 6 adds the ‑static‑libgcc and ‑static‑libstc++ linker options. Finally, Step 7 defines the executable, which depends on the auto-generated header files, the project header files and the source files. The ".exe" is not needed on the executable program name as CMake will automatically add it for Windows, but not for Linux (CMake is aware of which system is being used for building).

Sunday, June 19, 2011

New Make System

The standalone make file was originally generated by a feature in NetBeans, and was heavily modified to make it work, but it never looked very clean. Some consideration was given to rewriting it using the proper make rules (having individual build rules for each individual source file is not the way make is suppose to be utilized).

Continuing to use the make system within NetBeans was not desirable, though it allows the setup to deal with multiple platforms (Linux and Windows). A reason for not using it is that releasing all the associated files is not really an ideal option. And it never really worked right for the auto-generated include files from the awk scripts - warnings were generated for every build and no solution for eliminating these warnings was ever found (perhaps this was a bug in 6.9.1 that may have been fixed in 7.0).

Using the GNU autotools was out of the question - way too complicated and too much time coming up to speed with it. Instead the CMake system (Cross platform make generator) will be used as it is far simpler to setup and use. Plus NetBeans is supposed to understand CMake files. Adding CMake on Kubuntu installed version 2.8.2, however, this version did not contain the ccmake user interface for configuring the build. So the Linux source for version 2.8.4 (the latest) was downloaded (from here) and installed.

There are two prerequisite packages that are required to build the text CCMake GUI (more on this later), which are the libncurses5 and libncurses5-dev packages. These can be installed with this command:

sodu apt-get install <package-name>

Upgrade to NetBeans 7.0

Netbeans 7.0 was recently released (version 6.9.1 was being used for development on Windows), which contains support for git via a plug-in. Getting that to work took some research and experimentation. It appeared that the git plug-in only allowed creating a new git repository, not opening an existing one. However, it was discovered that if a new project is created from existing source and standalone make file in a directory with a git repository, NetBeans will then see the repository.

There was a problem building the project. The compiler complained that it didn't recognize the -static-libstdc++ option. NetBeans was still executing GCC 4.4.5, even though it was correctly pointing to the GCC 4.6.0 directory. Setting the path to include GCC 4.6.0 before executing NetBeans from the command line did not help.

Next an attempt was made to run the debugger on the project. There was no debug option in the standalone make file. Previously, the project was set up under NetBeans and it generated its own complex and convoluted set of files for generating what it needed to build the project. Both a Release and a Debug configuration was created by NetBeans. To release the project, NetBeans was used to create a standalone make file instead of releasing all the NetBeans files, which would have required NetBeans to build.

This time around, NetBeans was given this (heavily modified) make file when the project was created, though there was no Debug configuration. The make file was temporarily modified to turn on debug compiling. The make system needs to be reworked to allow both a release and debug configuration without using the system built into NetBeans.

Saturday, June 18, 2011

Linker Problem on Linux

Now that the program builds and runs on Windows, it was time to make sure it still worked on Linux. However, the linker failed because the compiler/linker did not understand that ‑static‑libstdc++ option. Upon research, it was discovered that this option was only in GCC version 4.5 and later, so it was not available in the GCC 4.4.5 on Kubuntu.

The latest version of GCC (4.6.0) was downloaded and installed on Kubuntu using the source code. This required quite a bit of work to accomplish (for complete details, hit the continue link). Now that the project was building on both Linux and Windows, it was time to introduce it to the NetBeans IDE (which was now at version 7.0, version 6.9.1 was being used for development on Windows).

Continued... »

Friday, June 17, 2011

Running on Windows

There were no problems building the modified project back on Windows, however, when running the regression test script on Windows, all tests failed. This occurred because the expected test results files were in Unix (LF only) format, but the freshly generated output files were in DOS (CRLF) format. The files otherwise appeared to be identical.

To verify that the output was the same, the test script was modified where the cmp ‑s command (‑s for silent) was replaced with a diff ‑bq command (‑b for ignore white-space, ‑q for quiet) to have it ignore the differences between CRLF and LF, both considered white-space. Now all tests compared to the expected results. A better solution will need to be found to deal with the differences between Windows and Linux.

One other problem was found when trying to run on Windows, which was discovered when trying to make the program run. Upon the first run, an error was reported that the application failed to start because libstdc++‑6.dll was not found. Once the environment variables were adjusted to include the MinGW binary path, the program ran fine.

This was the same problem as before with libgcc_s_dw2‑1.dll. This problem was resolved by adding a -static-libgcc option to the linker step, which links in the libgcc.a library. It turns out that there is similar option ‑static‑libstdc++ for linking in the libstdc++.a library. The program no ran without requiring any external library. This was tested by moving the program to another computer that did not have MinGW installed.

Thursday, June 16, 2011

Setting up and Building on Windows

Now that the code was building and working on Linux, it was time to setup the Windows XP virtual machine for building. On the previously Windows XP installation, MinGW (with GCC 3.4.6) and MSYS (1.0.11) were installed with a number of packages installed later (like GCC 4.4.0). It was a very hacked together setup in attempt to make things work right.

This time around, the Automated MinGW Installer was used. The current 2011-05-30 version was used, which contains the latest versions of MSYS, MinGW and all the other utilities. The latest version can be obtained here. When installing this package (which actually downloads all the latest versions before installation), both the MSYS and MinGW options were selected. The version of GCC installed was 4.5.2.

Moving the project software back over to Windows, the build succeeded. Turns out the awk scripts still work in Windows. There was also no warning about the gets() function. This may be because the version implemented with MinGW contains more features, which were mentioned back on June 15, 2010 and are not implemented in the Linux version.

Wednesday, June 15, 2011

Integers on 64-bit Linux

After a successful compile, the regression test script aborted with errors. The errors occurred because it was trying to execute ibcp but the make file was building ibcp.exe. This worked on Windows because it will execute ibcp.exe when the command is ibcp. The script was modified to run ibcp.exe instead (this issue will be resolved shortly). Now only two of the parser tests and one of the translator test failed.

The first parser test (immediate commands) was not working on the inputs containing large values for line numbers or increments that were suppose to be reported as an error, but now were being accepted and the value output was not the same as the input. A problem in the third parser test (numbers) was a negative integer one beyond the limit of an integer, was now also being accepted as an integer (it should have become a double since it was out of range for an integer). And a problem on the 17th translator test (negative values) was another negative number out of range of an integer, was also being accepted.

These problems were caused by the strtol() function calls used to convert strings to an integer. An overflow was expected for these large numbers. This function returns a long integer and on 32-bit Windows (or any 32-bit OS) this is the same as integer values, which is 32 bits, so an overflow was reported. However, on 64-bit Linux (or any 64-bit OS), long integers are 64 bits, so these large values were now being accepted, but when converting to an integer, were being truncated from 64 bits to 32 bits.

To correct this problem, in additional to the ERANGE (overflow) error occurred condition, a condition was added to check if beyond the maximum integer using the predefined constant INT_MAX (and INT_MIN when the value could also be negative).

The remaining problem in the third parser test was that only two digits were output for floating point exponents on Linux if a third (the hundreds) digit was not required. On Windows, three digits are always output for the exponent. This must be due to slightly different builds or implementation of the standard C library functions. How to deal with this problem for testing will be dealt with a little later.

Tuesday, June 14, 2011

Resolving 64-bit Linux Build Problems

There were problems with the awk scripts - they were not generating the auto-generated includes files. This was caused by the RS (Record Separator) and ORS (Output Record Separator) variables being set to CRLF to work with Windows. The files were now all in Unix (LF) format and so the input source files not being read correctly. So the RS and ORS lines were commented.

The git configuration option autocrlf was set to false, causing git upon checkout to change any DOS (CRLF) format files in the repository to be converted to Unix format (LF), or the CRLF was removed when the CVS repository was converted to git. There were previous problems with CRLF on Windows when switching between cvs with MSYS to CVSNT (all the files were changed to DOS format since CVSNT didn't seem to like Unix format files).

Next there were warnings about the gets() function calls in the test routines, which were not present on Windows (using GCC 4.4.0 on MinGW/MSYS). After updating with the package manager, Kubuntu had GCC 4.4.5, so perhaps this was a change between 4.4.0 and 4.4.5 or a difference between the GCC build on MinGW vs. Kubuntu. The warning complained that the returned value from gets() was being ignored. To remove the warnings, statements were add to check the return value.

The linker also warned that the gets() function is dangerous and should not be used. This is probably because gets() does not check for a buffer overflow of the allocated buffer given to it. The buffer allocated is plenty big for testing here and this is just the test code anyway, so this warning will be ignored from the time being.

Monday, June 13, 2011

Building on 64-bit Linux

Before attempting the build the project on Linux, the version control repository of the project code was converted from the Concurrent Versioning System (CVS) to git, a more modern version control system. The version installed using Kubuntu's software package manager was 1.7.4.4. However, this version did not have the needed cvs import utility, so the latest version was installed directly from the git repository, though git is required to retrieve it. This was version 1.7.6. Using git will have no impact on this project's software releases except that the CVS versions tags will be removed from the source files.

The intention was to use command line make file distributed with the project to build on Linux. Previously on Windows, NetBeans was used to build the project, though the command line make file (which was initialized generated by an option in NetBeans) was used to test the project before a release. For the moment, NetBeans would not be used (more about this later). Some minor problems needed to be resolved before it would build and work on Linux...

Sunday, June 12, 2011

Development Environment Change

Because of a job requirement, I had to install 64-bit Linux on my computer (I chose Kubuntu 10.10, Maverick Meerkat). So development of this project will be moved over to 64-bit Linux, at least for the time being. Since so far, the program only has a simple command line interface (no GUI), this won't cause an immediate problem. The releases will still be tested under Windows XP as I have a Windows XP virtual machine running on Kubuntu. Perhaps when the time comes to start on the GUI, a common tool set will chosen, something like Qt.

For the moment, this will also give the opportunity to present how to set up Windows to be able to build the project since this virtual machine is a new Windows installation and not a clone of my of my previous Windows installation. A lot of things were originally tried and listing the procedure that finally worked in the end was probably impossible.

Also possibly changing will be the actual development tools that will be used. Under consideration is changing the version control system (from cvs to git), the IDE (from NetBeans 6.9.1 to newer version 7.0, which has a plug-in for git, or possibly to the Eclipse CDT IDE), and the build system to build the project (from NetBeans' make system and standalone make file currently distributed with the project to the cmake build system). Each of these will be explained in the posts that follow...

Wednesday, March 30, 2011

Translator – Colon (Design)

The colon token will work very similar to the end-of-line token and so will have the end statement flag in its table entry. The difference with the end-of-line token is that instead of terminating the translation or a line, the token mode will be set to command for the next statement.

There will be two error conditions. Two colons will not be allowed. While a second colon would not affect anything and is allowed in other BASICs, it will be considered an error here. There is no point to allowing this. Also, a colon will not be allowed at the end of the line with no command after it.

When a colon token is received, the colon sub-code will be set in the command token on top of the command stack. The command token will be appended to the end of the statement when the command has been processed (by the command handler).

However, there is an issue with print statements. The actual print command token may not be appended to the output if there is a semicolon, comma or print function at the end of the print statement. In this case, the print command handler will have to transfer the colon sub-code to the last print code that was appended to the output, which may be a print type (when a semicolon is not needed), comma, semicolon, SPC or TAB.

Tuesday, March 29, 2011

Colon – Statement Separator

The Colon is used to separate statements in the BASIC language though it is not part of the ANSI BASIC, it is part of the more common BASICs (GW-Basic, QBasic, FreeBasic, etc.). Before moving to the translation of colon tokens, like everything else, the action during run-time must be defined. However, the Colon does not actually do anything at run-time.

Colons don't actually need to be stored as separately in the program, however, for them to be reproduced, something needs to be put into the internal program. The assumption that there will be a colon at the end of each statement except at the end of the line is not sufficient. Consider this statement:

IF A>B THEN PRINT A ELSE PRINT B:A=B:B=0.0

There is no colon after the first print statement. To reproduce colons properly, there will be a colon sub-code set for a command token that has a colon following the statement. For the statements after the ELSE in the example above, the colon sub-code will be set as shown in this translation:

B PrintDbl'Colon' A<ref> B Assign'Colon' B<ref> 0.0 Assign

When a line is reproduced, the Recreator will add a colon after a statement that has the colon sub-code set.

Sunday, March 27, 2011

Translator – A Unary Operator Curiosity

One of the test statements created for testing the unary operator fix was:

A = -B^NOT C% + -D*NOT E%

The intention of this statement was test the NOT unary operator in front of the second operand of both the exponentiation and multiplication operators. The translation of this statement was expected to be (the blue expression being the first operand and the red expression being the second operand of the addition):

A<ref> B C% NOT ^* Neg D Neg E% NOT *%2 + Assign

However, what was produced was this unexpected translation:

A<ref> B C% D Neg E% NOT *%2 +%1 Cvtint NOT ^* Neg Assign

Upon reviewing the precedence of the operators (see Translator – Operator Precedence) and the code, it turns out that this translation was correct. The ADD is higher precedence than the NOT, so the operands of the ADD are C% and the –D*NOT E% expression with MUL (*%2) higher precedence than ADD, its operands are the –D and NOT E%).

So while the exponentiation is highest precedence, with NOT having a low precedence allowed the ADD to bind the rest of the expression to the NOT, which becomes the second operand (yellow above) of the exponentiation, with the first negation being the final operator. In C/C++, the not (!) and negation (-) operators are very high precedence (and there is no exponentiation operator). But here, NOT was given a low precedence just above the other logical operators but below the math operators (see Translator – Operator Precedence for reasoning). Normally the NOT operator would probably not be used in the same expression as exponentiation like above.

Translator – Unary Operator Problem

While testing the negative constant changes, a new problem was discovered with unary operators, specifically this statement:

A = ---B

Which produced a “done stack empty” bug error at the first negation token. The problem occurred because the second negation operator forced the first negation operator from the hold stack because it was greater or equal precedence, and when checking the operand of the first negation operator, there was nothing on the done stack. Here and some additional examples:

A = B*-C
A = B^-C

The first statement translated correctly because negation is higher precedence than multiplication leaving multiplication on the hold stack. However, the second statement failed because negation is lower precedence than exponentiation forcing exponentiation from the hold stack but with only one operand on the done stack generating the done stack empty bug error. A new rule was needed for unary operators.

Basically, unary operators should not force any tokens (unary operators, binary operators, arrays, or functions) from the hold stack regardless of their precedence because not all of their operands have been received yet (the negate and its operand will be their operand and it has not been fully received yet). As currently implemented, other non-unary operators should still force unary operators from the done stack if the unary operator has higher precedence.

The check to force tokens from the hold stack was changed to if the precedence of the operator on the hold stack is higher than the current operator and if the current token is not a unary operator. Unary operators will now not force other tokens from the hold stack, but other tokens will still force unary operators from the hold stack if higher in precedence. While testing this change, a curious result was produced from one of the test statements...

Parser – Negative Constants

Negative constants were previously not considered by the Parser, which interpreted a minus as the subtract operator. The Translator then changed it to a negate operator when it appeared in the operand state. Consider these two examples (along with there current translations):

A = B-1.5 A B 1.5 Sub Assign
A = -1.5+B A 1.5 Neg B Sub Assign

The reason for the Parser to not look for signs on numerical constants can be seen in the first example. If the Parser produced the four tokens A = B -1.5, the Translator would generate an “expected operator” error at the -1.5 token since a second operand token was received when it was expecting a binary operator. The second example produces an unnecessary negate token after the constant. While perfectly valid, this is not desirable.

In order for the Parser to correctly interpret negative signs on numerical constants, it needs to be aware of whether the Translator is in operand state or not. If in operand state, the Parser can look for a negative sign in front of a number constant, otherwise a minus should be interpreted as an operator.

A new operand state flag was added to the Parser with an access function to set its value (which is initialized to off). The Parser get number routine was modified to have a new sign flag used to determine if a negative sign was found first. This flag will also prevent multiple negative signs. However, it will only check for a negative sign if no digits or a decimal was seen and if the new operand state flag is on.

An access function was added to the Translator to get the current operand state (either operand or operand-or-end state). Before calling the Parser get token routine, the Parser operand state is set from the Translator's current operand state. While testing, a problem was discovered with unary operators...

Saturday, March 26, 2011

Parser – Tokens With Parentheses

While correcting issues with define function tokens, it was noticed that it is not necessary to also store the opening parentheses in the string field of the token. This also includes the generic tokens with parentheses. The parentheses is not necessary because there are separate token types to identify tokens with and without parentheses.

It is also advantageous to not store the parentheses so that an array or define function name can be found in the dictionary. For define functions, take the code snippet:

DEF FNHypot(X,Y)
FNHypot=SQR(X*X+Y*Y)
END DEF
Z = FNHypot(3,4)

Both the FNHypot( and FNHypot tokens appear, which represent the same function. If the parentheses was stored in the dictionary for the function name, it would require complicated string comparisons to figure out that the FNHypot token is the same function. This same issue applies to regular function names.

A similar issue also applies to arrays. When functions and subroutines are implemented, there will be a feature to allow an entire array to be passed to a function or subroutine. Exactly how this will work has not been defined yet, but this code snippet shows how it would look:

DIM Array(10)
CALL subroutine(A)

The changes for removing the parentheses from these tokens were very simple. In the Parser get identifier routine, when creating the string for these tokens, the length provided to the string constructor was changed to be one less than the actual length so the parentheses would not be included. In the Translator when reporting an error at the open parentheses of a define function with parentheses token, the minus one was removed since the length is now one less. Finally, in the token test output routines, an open parentheses was added to the output of these tokens.

Project – Tokens Status Enumeration

Each time a new error (token status) is added, renamed or deleted, two changes were needed. Both the token status enumeration (include file) and the message array (source file) needed to be changed. The correct changes were needed or the two files will be out of sync. To check for problems, code had been added to initialization to check for duplicates and missing entries.

Similar to automatic generation of the code enumeration from the table entries, the token status enumeration will also be generated automatically. Each message array element was a structure containing a token status value and a pointer to a message string. The elements were changed to just the message string with the name of the token status in a comment at the end of the line.

The codes awk script was renamed to enums and was modified to also read the token status message array to generate the token status enumeration automatically be reading the name in the comment after the message string. The name of the output file was changed from codes.h to autoenums.h to be more generic and allow for additional automatic enumerations.

During initialization, in addition to checking for duplicates and missing entries, a translation index array was built to translate from token status value to index. Both the checking and the translation array were removed since they are no longer necessary. The awk script will check for duplicates and the token status is now the same as the index into the message array.

The error type template class is no longer needed for token status errors (but still for the table entry erros). Also, the duplicate and missing were no longer needed and were removed. Some problems were found in this template class for table entries where the range error was not working because the wrong constructor was being called. Instead of storing indexes to the errors, the variables were changed to be the type of the template. This made the range error constructor unique.

Friday, March 25, 2011

Translation – INPUT Command (Release)

The remaining problem was due to the input command handler deleting the token passed in (the token terminating the invalid string prompt expression), however, the caller (the call command handler routine) was also deleting the token since the command handler changed the token to point to the error token. The extra token delete was removed from the input command handler.

The INPUT command is fully working and ibcp_0.1.15-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. To support the INPUT command, several other changes were needed including making the token codes and table entries indexes one in the same, correcting print function issues, correcting sub-string assignment issues, handling assignment token mode differently, handling define function tokens correctly, implementing an end statement Translator state, and implementing the reference token mode.

Next up, a slight change to the direction of this project to make it a little more interesting. Instead of prodding along with the translation of more commands, work will begin of the other components including the encoder, dictionary, recreator, program (maintaining the internal program and the program editor), and the run-time module.

The goal is to get this BASIC working as there are (almost) enough commands to make a very simple BASIC program (input, assignments and output). Once this is working, more commands can be implemented. But first a couple of minor things will be implemented before proceeding with this new direction.

Thursday, March 24, 2011

Translation – INPUT Error Debugging

The main problem with the wrong tokens being reported for errors in the INPUT command was because the token with the error was not put into the command item structure passed to the INPUT command – a requirement of command handlers reporting an error. Once this was added, most of the errors were now pointing to the correct token.

A new “expected operator, semicolon or comma” error was added for when an end statement token (for example the EOL in the incomplete statement INPUT PROMPT A$) is received after a valid string expression because this is a little more accurate than just the “expected semicolon or comma” error.

The two remaining errors were “invalid mode” bug errors that were occurring in the end expression error routine that is called when an end statement token is received during operand state. Support for the reference mode needed to be added to this routine, which needed to return the “expected variable” error.

One problem remains. The error test statement INPUT PROMPT A+B*C is causing extra token deletes, which was detected by the memory leak detection mechanism that was implemented a little while ago. To be continued...

Wednesday, March 23, 2011

Translation – INPUT PROMPT Debugging

The first problem found with the INPUT PROMPT command was that reference mode was not being set after the string prompt expression was processed. The next problem was that the Translator was still in binary operator state following the comma or semicolon after the string prompt string expression was processed, so this required setting the state back to operand state, which lead to the discovered of some other minor state issues.

The first operand state is very similar to the operand state except that end expression tokens (like comma, semicolon, and EOL) are also acceptable (normally considered operators). This state was implemented for the PRINT command since these tokens are allowed when an operand is expected (for example the PRINT,,A statement). This state is also set after a command is received, and it is up to the command handler to decide if an immediate end expression is allowed (which it currently is only for the PRINT command).

The equal token handler was found to be incorrectly setting the first operand state in an assignment statement. This did not cause a problem because the end expressions operators were being caught elsewhere when in expressions incorrectly (by their respective token handlers). Anyway, calling it the first operand state was a little confusing and was therefore renamed more appropriately to the operand or end state. The equal token handler was corrected to set only operand state.

The valid INPUT PROMPT test statements are now working, now on to the invalid INPUT statements, which for the most part are not reporting the correct token where an error is detected or just not reporting the correct error including some bug errors...

Monday, March 21, 2011

Translation – INPUT Debugging

Several minor issues were found and corrected. When the end of the INPUT command occurs, the last input parse code has to be marked with the end sub-code. Support for reference mode also needed to be added to comma token handler.

When the EOL token was received by the INPUT command handler, it reused the token for the input assign code for the variable. Upon return from command handler, the EOL token handler proceeded to delete the EOL token (which was not the EOL token anymore). To prevent this, a check was added that if the token no longer contains an EOL code, it is assumed to have been used by the command handler and will not be deleted.

The InputBegin code was just being appended to the output, but the first variable had already been added to the output, so it was after the variable instead of at the start of the statement. This code could be inserted at the beginning of the output list, however, this would only work if the INPUT command was at the beginning of the line, which may not be the case (multiple statements per line will be supported).

So instead, the element pointer in the command item will be set to the current last item in the output list when a command is pushed to the command stack. Since this pointer can no longer be checked for null to determine if an input begin code has been added, a new input begin command flag was added, which is set once an input begin code has been added to the output.

Another problem found was that none of the input variables had their reference flag set once added to the output. The find code routine was not checking for a reference (not important) but was clearing the reference flag (a problem) since the token being checked (the input assign code) did not have its reference flag set. The reference flag of the input assign token was set before calling the process final operand routine to make the find code routine work as desired, and then cleared upon return.

The valid INPUT test statements are now working, now on to the INPUT PROMPT statements, which are not working...

Sunday, March 20, 2011

INPUT Translation – Variable Handling and Ending

To look up the appropriate input assign code, the process final operand routine is used, which calls the find code routine that checks the token on the done stack to see if the reference flag is set since the input assign codes have the reference flag (this should not be necessary since the INPUT command uses the recently implemented reference mode). The reference variable will be popped from the done stack and the input assign code will be appended to the output before returning.

The InputAssignInt and InputAssignStr are associated codes for the InputAssign code. Once the input assign code has been appended to the output, the appropriate input parse code needs to be inserted after the input begin code or last input parse code. The easiest way to get the appropriate input parse code was to have each (InputParse, InputParseInt, and InputParseStr) be associated codes to the input assign codes. The second associated code will be used for these.

When a final semicolon token is received, the stay on line command flag is set (the same flag used for the PRINT command) and the state is set to end statement. When an end statement token is received, the INPUT command handler check if the stay flag is set and sets the keep sub-code of the INPUT command token, which is then appended to the output.

If an end statement token is received with no semicolon, the INPUT command token is immediately appended to the output without the keep sub-code. The INPUT command handler has now been implemented and the code compiles, so debugging can begin...

Saturday, March 19, 2011

Translator – End Statement State

Once a semicolon is received at the end of an INPUT statement, no more tokens should be received except for an end-of-statement token. If another token is received, then an error should be reported. To accomplish this, a new end statement state similar to the end expression state is needed. While the end expression state only needs to be checked when operators are expected, the end statement state needs to be checked for all tokens.

The end statement state check was added just before the check for operand or first operand state in the main translator add token routine. When in end statement state, if the token does not have the end statement flag set in its table entry, then an “expected end-of-statement” error is reported against the token.

The access function for getting the table entry flags for a token already checks if the token has a table entry, and if there is not table entry, then zero (no flags) is returned. Currently only the EOL code has the end statement flag set, but eventually the Colon, ELSE and ENDIF tokens will also and possibly other codes.

Friday, March 18, 2011

Translation – Define Function Token Issues

For now in the process operand routine, define function with parentheses tokens will not be allowed in command or assignment mode. This will need to be changed later when the DEF command is implemented. This check was also made in the check assignment list item routine.

However, since a define function without a parentheses token is allowed in assignments, the error was set to point to the open parentheses as an "expected equal or comma for assignment" error. The open parentheses is at the end of the token, so to get the error to point to it, the column of the token was incremented by the length of the token minus one.

Previously in the close parentheses token handler, the reference flag was being set for token with parentheses and define function with parentheses tokens. This check was modified to only set the reference flag for token with parentheses.

The reference flag for define function without parentheses tokens was already being set, so no change was needed for these tokens.

Thursday, March 17, 2011

Language – Define Functions

Before defining what needs to be done with define function tokens in the Translator, a quick review of their syntax is required. The will be two forms of define functions that will be supported, a single line and a multiple line. An example of a single line define function would be:

DEF FNHypot(X,Y)=SQR(X*X+Y*Y)

Notice that this form has the same format as an assignment except for the DEF command at the beginning. The define function token in this statement is FNHypot( - in other words, a define function with parentheses. This implies that a define function with parentheses token could appear in assignment mode (assuming the DEF command sets this mode), but only for the DEF command. An example of the multiple line form of the same function would be:

DEF FNHypot(X,Y)
FNHypot=SQR(X*X+Y*Y)
END DEF

The assignment of the define function name (a define function without a parentheses) returns the value for the function. Here a define function without a parentheses can appear in an assignment statement, but only inside a DEF/END DEF block. Since the Translator is not aware of blocks, it will permit an assignment of define function without a parentheses token. It will be the Encoder's job to verify if the assignment is valid.

Wednesday, March 16, 2011

Translation – Assignment Token Issues

Finally, some issues were discovered when making the change from command mode to assignment after processing an operand token. There are two main types of operand tokens, ones with parentheses and ones without parentheses.

The operand tokens with parentheses include internal functions, defined user functions (DEF FN) and generic tokens with parentheses (which can be arrays or user functions). Internal functions were already invalid for command or assignment mode except for sub-string functions. A check was added for defined functions with parentheses.

The operand tokens with no parentheses include internal functions with no arguments (currently only RND), constants, define user functions with no arguments, and generic tokens with no parentheses (which can be variables or user functions with no arguments). Internal functions and constants were already invalid because they didn't have the reference flag set when the comma or equal token looked for the reference flag. That left define function tokens...

Tuesday, March 15, 2011

Translation – Command Token Issues

The last problem statements related to unexpected command tokens were:

A PRINT B
MID$(A$ PRINT,4)=""

The first statement gave an “expected operator or end-of-statement” error at the B token. The second statement was actually accepted, but with a strange translation. The problems were caused because when the PRINT command token was received, it was immediately pushed to the command stack because the mode was still set to command.

Both statements start as assignment statements, but assignment mode was not being set (unless preceded by the LET keyword). When an equal token is received expression mode is set, or when a comma token is received assignment list mode is set. The “unexpected command” error was only occurring when the mode was not set to command, which didn't occur with the statements above. Also, this message again does fit with the “expected ...” type of message.

Once a command token is received in command mode, the mode is set according to the token mode in the command's table entry. A change was made to the process operand routine that once an operand token is processed, if in command mode, the operand is assumed the beginning of an assignment statement and so the mode is changed to assignment.

To remove the “unexpected command” error and report a more appropriate error, the command token has to be passed through the rest of the Translator. This will occur when the Translator is not in command mode. So the main add token routine was changed to not report this error if a command token is received and the mode is not command.

Command tokens received in operand state were already being reported correctly since commands are also considered operators, which are not valid operands (unless the operator is a unary operator, which commands are not). Commands are considered operators because some commands can be found where an operator is expected, for example, THEN and ELSE.

Commands tokens received in operator state are only valid if they have a token handler. In the process operator routine, when an operator token does not have a token handler a default operator token handler is called. Before the default operator token handler is called, a check was added to return an appropriate error if the token is a command.

Sunday, March 13, 2011

Translation – Other Errors

After changing the “item cannot be assigned” to “expecting item for assignment” error, there were several other errors that didn't fit the “expecting ...” type of message. It turned out that most of these were not actually being used, so there were removed.

Another remaining message was the “missing open parentheses” error that occurs when there is a parentheses with no open parentheses, function or array. After some consideration of possibly leaving this message as is, it was decided to change this to an “expected operator or end-of-expression” error since the problem could also be a missing function or array, or even that the open parentheses was just a mistake.

Again assuming that everything is correct up to the problem, this change seemed appropriate, and “...expression” was used instead of “...statement” because the next token could be a comma or semicolon in a PRINT statement or a THEN in an IF statement.

The last message was the “unexpected command” error that occurs when there is a command token when not in command mode. However, there were a number of additional problems with command tokens received when not expected...

Translation – Parentheses Issue

The next problem statement was:

MID$((A$),4)=""

This was reported as an “item cannot be assigned” error and then crashed. Again, this error didn't fit the “expected ...” messages. This error is also returned for statements like 3=A and 1,A=B and was renamed to the “expecting item for assignment” error. For the statement above, the “expecting string variable” error should be returned.

The crash occurred because the open parentheses token was returned for the error with its range extended to the closing parentheses to report the entire (A$). The caller deleted the error token since it was an open parentheses to prevent a memory leak. However, in this case, the A$ token was still on top of the done stack with the open parentheses attached as the first operand. When the Translator clean up routine (called upon an error) was emptying the done stack, it deletes each item's first and last operand – the open parentheses was getting deleted twice causing the crash.

Initially to fix this problem, when an error occurs and the first through last operand is returned, the first operand pointer for the item on the done stack was set to null to prevent it from being deleted a second time. While this fix was sufficient for the statement above, this statement still were not being reported correctly:

MID$(-A$,4)=""

The error was “expected numeric expression” pointing to the A$. Both the open parentheses and the minus are initially processed in the process unary operator routine. So a check was added to this routine to return an error if there is a sub-string function on top of the hold stack with its reference flag set (sub-string assignment) and it is as the first operand. Both of theses statements then correctly reported “expected string variable” at the open parentheses and minus. The initial fix was not necessary since the error was being caught sooner.

Saturday, March 12, 2011

Translation – Another Print Function Issue

While the new reference mode was being implemented, some problems were discovered in the code that carefully constructed statements would exploit giving incorrect results. The first problem statement was:

TAB(10)=A

This caused a “command stack empty for expression” bug error. This should have been “expecting command” error. The check was if the command stack was empty or there was a PRINT command on top of the command stack or if the top of the hold stack was not empty (except for the null token).

The error occurred when trying to get the type of expression because the hold stack was empty (first item to figure out the expression type) so it when to the command stack, which was also empty. This check should not have caught this because there was a check after this with for catching all internal functions in command mode. The check was modified to if the command stack is not empty and if there is a PRINT on top of the command stack or the hold stack is not empty.

Thursday, March 10, 2011

Translator – Reference Mode

A new token mode is needed for the INPUT commands. The current token modes are command, assignment, assignment list and expression. Consider these invalid INPUT statements:

INPUT A+B*C
INPUT A+B*C$

The INPUT commands must have variables, not expressions. If expression mode was used, at each comma and the end of the statement, the INPUT command handler would need to check the token on top of the done stack to see if its reference flag is set. The first statement above, the INPUT command handler will see the multiplication token on top of the stack. Its first token will be A and last token will be C. An “expecting variable” error would be reported pointing to the whole A+B*C expression. This would be acceptable.

However, in the second statement, an error occurs before the INPUT command handler gets a chance to report an error. When the multiplication token checks its second operand, it will report an “expecting numeric expression” error pointing at the C$ token. This would make no sense. The proper error for both of these statements would be an “expecting semicolon, comma or end-of-statement” error pointing at the add token.

The new reference mode will only accept variables and array elements, specifically tokens with no parentheses and tokens with parentheses. If these tokens turn out to be user functions, which the Translator cannot determine, the error will be reported by the Encoder. Reference mode will be set when the INPUT command is received and after the comma or semicolon of the prompt string expression of the INPUT PROMPT command. It will also be used later for the READ command.

In reference mode, internal functions, define functions, and unary operators will be reported as invalid (“expecting variable” error). After the variable or array element token is pushed to the done stack, the state will be set to end expression so that only end expression tokens are valid. Binary operators will then be correctly be reported as invalid.

While reference mode was being implemented, some more problems were found in the Translator code...

Sunday, March 6, 2011

Translation – PRINT Function Problem

Due to the problems found with the print codes, some additional error tests were added to translator test 10 (PRINT statement tests) including these statements:

PRINT (TAB(10))
PRINT INT(TAB(10))
PRINT A(TAB(10))
PRINT A+TAB(10)
PRINT TAB(10)+A

The first three test the situation of a print function inside a parentheses, internal function and an array or user function, which were caught by adding count stack is not empty check when a print function token is received.

In the fourth statement, the error can be caught by checking if the hold stack is not empty (an empty hold stack has only a null token). In this case, an operator will be on top of the hold stack. In fact, this check can replace the count stack is not empty check because the open parentheses, internal function and array/user function will be on top of the hold stack.

The last statement was more difficult to check for - the expression should end after the print function. When the operator was received and checked its operand, a bug occurred because the done stack is empty since the print function didn't get pushed to the done stack. The error should be “expected semicolon, comma or end-of-statement” and point to the operator.

To catch this error, a new end expression state was added. The closing parentheses is received when in binary operator state. After the closing parentheses, the state was left at binary operator since another binary operator is expected (an end of expression token is acceptable as a binary operator). For the end expression state, only operators with the end expression flag are acceptable, which currently include the semicolon, comma and end-of-line tokens – other operators will generate the error.

In resolving these issues, the “invalid used of print function” error did not match the other “expecting...” errors (remember the goal is to help the user by suggesting what is expected at the location of an error). Therefore, this error was changed to be one of the “expecting xxx expression” depending on the current expression type (xxx would be blank, numeric or string).

Saturday, March 5, 2011

Translation – PRINT Code Issues

The process final operand currently doesn't push print codes to the done stack. This needs to be expanded to include the input begin prompt codes. It would be inconvenient not to keep expanding this test to include additional codes that don't need to be pushed to the done stack. The check could be changed to see if the code does not have a return data type (it is set to none). This would include the print codes, the input begin codes, the input parse type codes, the input assign type codes, and probably many more.

However, when this change was made, it did not work for the TAB and SPC print functions because the token data type for these had been incorrectly set to double. This occurred in the set default data type function when the token was received. This function was fixed to not set the data type for internal functions (these are set from their table entry data type).

Now when the token's data type is none, it won't be pushed to the done stack. The check if the command on top of the command stack is the PRINT command only applies to print functions, but is not necessary here as this check was already made when the print function was first received, so the check was removed. The check if the token's has the print flag remains (to set the stay and print function command flags, which is used by the print command handler to determine if the final print code should be appended to the output).

For the print type codes, when process final operand routine is called by the add print code routine, the second token passed is a null, not a closing parentheses. The process final operand was deleting the second token for print functions assuming that it was a closing parentheses (which only applies to TAB and SPC). For some reason, doing a delete with a null argument was not causing a problem. None the less, a check was added to only delete the second token if it is not a null.

Another problem was discovered for print functions. The situation if a TAB or SPC was contained within parentheses, internal function, or and array/user function, was not caught since it was only checking if there was a print command. Therefore, an additional check was added to make sure the count stack is also empty.

Thursday, March 3, 2011

INPUT Translation – Some Issues

As the INPUT command handler was being implemented, some issues were found. When the new element pointer was added to the command item, it was noticed that there was a code member. This member is no longer needed because the token now contains the code (replacing the index member, through the code is now an index), so the code member was removed from the command item.

There was no table entry for the two word INPUT PROMPT command, so one was added along with the three input begin codes. It was noticed that some table entries still had the string flag set. This is not necessary because the string flag is now set automatically during table initialization it there are any string arguments, so these string flags were removed.

To look up which input begin code to append for the INPUT PROMPT command, the process final operand routine will be called, which in turn will call the find code routine that will pick the correct code based on the type (string or temporary string) that is on the done stack. The input begin codes will not push anything to the done stack (accomplished by setting the done push flag to false). Some additional issues were found in these routines...

Wednesday, March 2, 2011

INPUT Translation – Variables

Processing variables is a bit more complicated. With the PRINT statement, the appropriate print value type code was simply appended to the output. However, with the INPUT statement, an input parse code needs to be inserted after the begin code or after last parse code (before all of the variables), and an input assign code needs to be appended to the end of the output (after the variable).

There needs to be a way to point to the location where the input parse codes are to be inserted. This will be accomplished with a new output list element pointer member to the command stack item. This pointer will be initialized to null when a new command token is pushed to the command stack.

When an input begin code is appended to the output, this new element pointer will be set to the input begin code element. This pointer can also be used to indicate if any input variables have been received yet (when it does not contain a null).

To insert a input parse code, the input parse token will be appended to the list at (after) this element pointer. The element pointer will then be set to the output list element of the input parse token just inserted so that the next input parse token will be inserted after this input parse token.

Monday, February 28, 2011

INPUT Translation – Beginning

Execution of the INPUT command has been defined and therefore the form of the translation, the actual process of the translation can now be defined. The translation begins with one of the InputBegin codes, which will be triggered by a comma or semicolon token. The INPUT or INPUT PROMPT token will already be on top of the command stack.

For the INPUT command, when the first comma, a semicolon or an end-of-line token is received, an InputBegin token will be appended to the output. The token received can be converted to the InputBegin token (more efficient to change the token than to delete the token not needed and then create a new one).

For the INPUT PROMPT command, when a comma or semicolon token is received, there must be a string on top of the done stack (the prompt string expression). Depending on whether this string is temporary or not will determine whether an InputBeginStr or InputBeginTmp will be appended to the output. For InputBeginStr, the string will be attached since the translator will not know if it is a variable, array or a user function. The token received can be converted to this token. An end-of-line token at this point would produce an “expected comma or semicolon” error.

Sunday, February 27, 2011

Automatic Code Enumeration Generation

To have the Code enumeration generated automatically from the table entries, the table source was structured so the awk scripts can read it. Since the Code enumeration value will now be the same as the table entry index, the code member of the table entry is not necessary and was removed. The code name initializers in the table entries was moved to a comment on the line of the entries' open brace.

The awk scripts were rewritten to read the table source file instead of the main include file. The awk script were also changed to read the table source file directly and write the output files directly. This eliminates the requirement to redirect the input and output to the correct files when running the awk scripts. Logic was also added to the codes awk script to check for duplicate code names.

The code to index conversion array that was initialized in the table class constructor, along with the check for duplicate and missing codes, was removed. The code and index access functions in the table class were also removed. The token class index member was replaced with a code member. All the code was updated to use the code enumeration value instead of the index, though the code will be used as an index.

One problem with using an enumeration value instead of an integer index is that normal math functions cannot be used, like the add and increment operators. These operators are needed, so operator functions were created for the code enumeration, which includes the add, prefix increment and postfix increment operators. These functions type cast to integer to add and then type cast back to the code enumeration value.

The null code entry at the end of the table was moved to the beginning so that that null code enumeration value (index) would be zero. The table search function for searching for an immediate command previously assumed that the immediate commands were at the beginning of the table entry array. Moving the null code to be the beginning of the array complicated this. Therefore, immediate command bracketing codes were put around these entries for this search function.

Due to the these table entry changes, the parser test output files were updated since all the code indexes changed. This would be a good time to make another pre-release, but since there have been no download activity for recent pre-releases, there will not be a pre-release at this time. Now the translation of the INPUT command can begin...

Saturday, February 26, 2011

Code Enumeration vs. Table Entry Indexes

While designing the error handling mechanism for the INPUT command, a thought occurred related to program codes. For efficient program execution at run-time, the index of the table entry will be stored in the internal program code, not the code enumeration value.

If the code enumeration value was used, the index for the table entry (needed to get the run-time handler function pointer) would first need to be converted to an index by going through the code to index array setup during table initialization. The intention all along has been to use table entry indexes in the internal program code.

For the INPUT command's error recovery, when it is backing up execution and checking for input parse codes, it will need to convert the table entry indexes to a code before it can check if it is an input parse code. This would not efficient during program execution. Even though for the INPUT command, execution time is not critical since it is about to stop and wait for user input. A few extra program cycles won't matter much. But this problem could occur for other more critical commands.

Therefore, it is desirable if the code enumeration values were the same as the table entry indexes. One simple solution is to make sure the code enumeration values matched the table entries. Unfortunately this relies on the programmer to keep the two in sync, is very error prone and is just a general pain to begin with.

There is a better way where the code enumeration is generated automatically from the table entries using an awk script. This method would be similar to how the test_codes.h file (used by the test_ibcp.cpp source file) is generated automatically by scanning for codes in the ibcp.h file.

INPUT Execution – Error Handling

Errors can occur while parsing the input – in one of the input parse codes. When an error occurs, after it is reported, the already parsed values in the temporary input values stack need to be thrown away and execution needs to resume at the beginning of the INPUT statement, except instead of issuing the prompt again, the cursor will be positioned at the beginning of the input, which will contain the previously erroneous input to allow the user to correct the input instead of reentering it (the traditional “redo from start”).

The temporary input values stack can't simply be reset (setting the internal index to -1, the empty stack indicator) because elements may contain allocated string values, which need to be deleted to prevent memory leaks. Like the evaluation stack, the elements in the temporary input values stack won't have any indicator what data type they are. Consider the basic format of the INPUT statement (only up to the parsing codes is shown):

InputBegin InputParseType1 InputParseType2 InputParseType3'End' ...

Say an error occurs on the second parse code (the program execution pointer will be pointing at next code, the InputParseType3, in other words, the pointer is incremented after reading each program code word, then the code read is executed by calling its run-time handler). Execution needs to be backed up until the InputBegin is reached (reading the program codes in reverse).

As each code is passed, one element will be popped from the temporary input values stack. If the code passed is an InputParseStr code, then the element popped is a string that needs to be deleted. The beginning is reached when a non input parse code is reached (it could be InputBegin, InputBeginStr or InputBeginTmp). The stack will now be empty.

The INPUT begin code will be executed again and the begin code will call the get input routine. The get input routine will normally allocates the temporary input values stack. For error recovery, it will see that the stack is already allocated, so instead of outputting the prompt and saving the cursor position to the beginning of the input, it will restore the saved cursor position and get the input starting with the previously entered erroneous input. Execution will then resume with the corrected input.

Friday, February 25, 2011

INPUT Execution – Temporary Values

The values parsed from the entered input to be assigned to the input variables need to be stored somewhere other than the evaluation stack. The logical place is another temporary input values stack. This stack will be allocated and initialized in the get input routine and will be removed by the final INPUT command code.

After a value is parsed by one of the input parse codes, it will be saved (pushed) to a temporary input values stack. At the last input parse code (with the 'End' sub-code), an index to be used to access this stack will be set to zero. As each value is assigned by an input assign code, this index will be incremented. In other words, this stack is being used as First-In-First-Out out list instead of a Last-In-First-Out standard stack.

The SimpleStack (to be renamed to just Stack since there is no other class named Stack), does not currently have this mechanism. This will be added when the run-time code implemented, but it is not needed currently for translating the INPUT command, so this addition will wait.

Thursday, February 24, 2011

INPUT Execution Codes – Ending

There will be the final INPUT command code at the end of the input statement. Besides cleaning up, the only action that needs to be performed is to advance to the next line if the 'Keep' sub-code is not set (this sub-code is set when there is a semicolon at the end of the INPUT statement). In summary, the “INPUT I%,A(I%)” statement will be translated as:

InputBegin InputParseInt InputParseDbl'End' I%<ref> InputAssignInt I% A(<ref> InputAssignDbl Input

As usual for RPN format, the command is at the end of the translation. What remains to be designed is where the temporary input values will be stored and how errors will be handled. An error can occur in the parsing codes. When an error occurs, execution must go back to the InputBegin, however, the prompt does not need to be output again, the cursor only needs to be positioned back where is was after the prompt was output (after the error is reported).

Wednesday, February 23, 2011

INPUT Execution Codes - Assigning

The assigning of the input values must be done separately from the parsing of the values entered due to the two input rules. For the example, this is steps 7 through 11. Some of these steps are standard expression codes: push a reference to an integer variable (step 7), push a value to an integer variable (step 9), and calculating a reference to an array element by popping a integer subscript value and pushing the reference to the element (step 11).

There will be a code to assign an input value to an input variable for each data type: InputAssignInt, InputAssignDbl and InputAssignStr. For InputAssignStr, the InputParseStr will created a string from the input value. This string will be assigned to the string variable replacing the previous string value, which will be deleted. Therefore, there will be no need to deal with temporary strings.

The values being assigned will need to stored temporarily somewhere other than the evaluation stack.

Tuesday, February 22, 2011

INPUT Execution Codes - Parsing

The parsing of the entered input values must be done separately from the assignment of the values entered to the input variables. This a departure from the design previously described and is due to the two input rules. For the example, this is steps 3 through 6. Notice that after a value is parsed (steps 3 and 5), a check is made for the next character. This character must be a comma after each value except for the last value where an end-of-line (no character) is expected.

There will be a code to parse an input value for each data type: InputParseInt, InputParseDbl and InputParseStr. An 'End' sub-code will be set on the last parse code. If this sub-code is not set, then the next character must be a comma, otherwise an end-of-line (no character) is expected.

The values that are parsed need to be put somewhere. If input values were pushed on to the evaluation stack, then when the references to the input variables are pushed, the input values would be down the stack and would not be easily accessible. Therefore, the evaluation stack can't be used.

Monday, February 21, 2011

INPUT Execution Codes - Prompting

Execution and translation of the INPUT command was previously described mostly in posts on June 24, 2010, June 25, 2010 and June 27, 2010 . This now needs to be revised since the execution broke the two rules listed at the end on February 16, 2011. Steps 1 and 2 handle issuing the prompt and getting input from the user. This is almost the same as previously defined, which are these codes:

InputBegin – output default prompt and get input
InputBeginStr – output string prompt and get input
InputBeginTmp – output string prompt, get input and delete temporary string

During execution, the run-time handlers for each of these codes will call a common routine for getting input. This common get input routine can also handle outputting the prompt and can simply use the string on top of the evaluation stack. There will be an argument for whether to output the prompt string on top of the stack and InputBegin will set this argument to false.

There will be another argument for whether to output the default prompt where InputBegin will set this to true and the other two will set to true if the 'Question' sub-code is set (if the prompt string expression was followed by a comma instead of a semicolon).

For InputBeginTmp, the temporary string can be deleted upon returning from the get input routine since it will no longer be needed with the improvement in execution described in last Saturday's posts.

Saturday, February 19, 2011

INPUT Command – Execution Procedure

To determine how the INPUT command will be encoded into internal memory, the internal codes need to be arranged according to how the INPUT command will be executed including handling errors. Using the same “INPUT I%,A(I%)” example statement from before, here is the procedure (for the moment without taking into account error handling):

Issue prompt (for this statement, the default “? ” prompt)
Get the input from the user (allowing for editing like backspace, cursor left/right, insert/overwrite, delete, and terminated by enter)
Parse an integer value from the input for I% and save it (I% can't actually be assigned yet)
Check if the next character in the input is a comma
Parse a double value from the input for A(I%) and save it (A(I%) can't be assigned yet since I% hasn't been assigned)
Check if there are no more characters
Push reference of I% to evaluation stack
Assign saved integer value to reference on top of stack – I%
Push value of I% to evaluation stack (I% has now been assigned)
Calculate reference for array A by popping index from evaluation stack, push calculated reference, A(I%), to stack
Assign saved double value to reference on top of stack – A(%)
End of INPUT command, advance to next line of output

Codes need to be defined for these steps. Some of these steps are already standard token types, step 7 is I%<ref>, step 9 is I%, and step 10 is A(<ref>. The rest of the steps will be INPUT command related codes.

INPUT Command – Execution Improvement

An improvement can be made to the execution of the INPUT command that will simplify the execution and will work nicer from the user standpoint. The traditional implementation of the INPUT command was designed to work with Teletypes. This has no place on modern computers. This has to do with what happens when the input entered by the user is invalid.

After surveying several BASIC implementations, all do a form of “Redo from start” on a new line and then reissuing the prompt on another new line for the input again and forcing the user to start their input from the beginning. Many of the implementations don't even check the presence for all the values requested or even accept string values and then simply set the non-entered or invalid values to zero. This is very sloppy programming and forces the programmer to do more work validating input.

A better alternative is to properly parse and validate the input entered. If something isn't valid, then output a temporary error message, point to where the error is (so the user doesn't have to guess what was wrong) and then allow the user to edit their input. Using extra output lines is unnecessary. The error message will be removed from the screen when INPUT is done. This also simplifies run-time in that the prompt string does not need to be saved until the end of the INPUT statement (deleting it if is a temporary string) since it now only needs to be output once.

There are other enhancements that can be made to the INPUT command, like having a fixed length input field with an optional template and accepting special exit keys (function keys, escape, page up/down, arrow up/down, etc) that the programmer can check for, but this will be later once the project is up and running. For now, the translation of current INPUT command needs to be implemented. As always, before the translation can be implemented, some idea how the INPUT command will work at run-time is needed to determine what tokens need to be put into the RPN output list.

Wednesday, February 16, 2011

Translator – INPUT Command – New Problem

A few new ideas for the INPUT command have been developed that may simplify the execution of the INPUT command and work better than traditional implementation (more on this in a bit). Upon reviewing the previous posts on the INPUT command and reading the ANSI Standard document for the INPUT command again, a problem was discovered with the previous design. The problem can be shown with this INPUT statement:

INPUT I%,A(I%)

This is probably bad programming - if the value entered is outside the range of the array, an exception occurs. In any case, it is allowed. Both GW-Basic and QBASIC act as expected where the first value entered becomes the index of the element that is assigned the second value. However, the design laid before won't act as expected. Consider the planned translation for this statement:

InputGet I%<ref> InputInt I% A(<ref> InputDbl Input

This will not work, consider the items pushed to the evaluation stack upon reaching the Input code at the end:

Input information (prompt is specified, question flag, and location)
I%<ref>
Parsed integer value from input
Pointer to the integer assign routine
A(I%)<ref> (after I% pushed, popped and reference to A(I%) calculated)
Parsed double value from input
Pointer to the double assign routine

Notice in item 5 that the value of I% used to calculated the reference to A(I%) would not be the value of I% just parsed from the input because I% has not been assigned yet, which would occur after all of the input has been parsed and found to be valid. Making the implementation difficult are two rules that must be followed:

Can't put references to the evaluation stack until previous values have been assigned.
Can't assign any variables to the input values until the entire input has been parsed and validated.

Monday, February 14, 2011

Translator – Restructure Pre-Release

Since the restructuring of the code was significant, it is probably a good time to make a pre-release of the code before commencing with the INPUT command. The file ibcp_0.1.15-pre-1-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. When uploading these files, it was discovered that the ibcp_0.1.14-src.zip was missing the complete set of test files, and so was uploaded again. The shell script used to automatically generate a release was using the wrong list to generate the source zip file. Now on to the INPUT command (finally)...

Sunday, February 13, 2011

Translator – More Code Restructuring

It turns out there were really no more issues, just some minor bugs that needed to be corrected. However, in looking at the error messages, the “expected statement” just didn't seem to be as clear as it could be and so was changed to “expected command” along with the name of the corresponding token status.

The size of the add token routine has been getting out of hand for some time. It's not really a good idea to let a function get so big because it becomes much harder to understand and maintain. So it is time to break it up into smaller functions. Generally, no variables need to be passed between these functions except for a reference to the token pointer and token status (usually the return value). The add token function was broken up into these functions:

process operand – handles operands when in operand status and token is not an operator

end expression error – gets the error when an expression is ended prematurely (an end expression token is received in operand state) and is only called when the state is not first operand

process unary operator – handles the checking if an operator token is a unary operator received in operand state including processing open parentheses tokens

process binary operator – handles tokens when in binary operator state

process_operator – empties higher precedence operators from hold stack adding them to the output list (functionality merged with the add operator function, which was removed since it only contains a few lines), then calls the token handler for the operator token

Now the code will be a bit more manageable as the INPUT and other commands are implemented.

Saturday, February 12, 2011

Translator – PRINT and Assign Restructuring

Only the PRINT and Assign commands commands are implemented so far, which were restructured where the related code for these commands contained in the comma and semicolon tokens was relocated into the command handlers. These token handlers will now call the command handlers just like the end-of-line token handler does. A switch was added to the command handlers on the code of the token passed in. Previously this token was only the end-of-line token.

The code from the end-of-line token handler that called a command handler was moved to a new call command handler routine that does as described in the last post except at the beginning, a check was added if the command stack is empty. This condition can occur in an assignment statement before a comma or equal token has been received. A null token status will be returned for the caller to report the appropriate error.

For the comma token handler, a null token status can only occur in expression only test mode because this mode does not put a command on the command stack. For the semicolon token handler, a null token status can occur if the semicolon is at the beginning of the line (where a command is expected) or after a variable (where a comma or equal is expected). The same conditions don't occur at a comma because there are specific checks depending on the current mode (command, assignment list or expression), where as the semicolon doesn't check the mode.

For the PRINT command handler, the code related code was moved moved from these token handlers into the appropriate cases of the switch. Since no other tokens are currently expected, the default case was set to return an unexpected token bug error.

The Assign Command handler only expects an end-of-line token, so for the default case, the assignment list mode returns a equal or comma expected error and expression mode returns an operator or end of statement expected error when expecting a binary operator or a numeric or string expression expected error when expecting an operand. Before returning the error, the command item token needed to be set to the token passed in to point the error to the token causing the error.

A few other issues were discovered after the code was restructured and tested...

Wednesday, February 9, 2011

Translator – Token/Command Handler Restructuring

While looking over the design notes for the INPUT and INPUT PROMPT commands, besides implementing a new INPUT command handler, parts would also be implemented in the Comma and Semicolon token handlers just like the PRINT command. As further commands are implemented, these token handlers will increase in size. The code to handle a command will be located in several different routines. A better design would be if all of the command is processed in a single routine – the command handler.

The EOL token handler already calls the command handler, where each currently assumes that it is called for an EOL token. The code that would be in the comma and semicolon token handlers could be moved to the command handler, which would have a switch on the code of the token passed in to perform the appropriate action. Then the comma and semicolon token handlers would also call a command handler. If the command doesn't support the passed token, then it would return an error.

There is a sequence of code to call a command handler: get the command item from on top of the command stack, get the command token's command handler from the table, return an error if there is no handler in the table, call the command handler, if it returns an error then check if the command handler changed the token (for the error) and if is has delete the original token passed, and set the error token to return. This sequence will be put into a new call command handler routine so the code is not repeated in each token handler.