Friday, August 30, 2013

The Encoding Procedure

There are several steps for encoding a translated RPN token list for a program line into an internal program code line.

Step 1: Each token type that does not have a code needs to be assigned a code.  These token types include identifiers with and without parentheses, constants, and define functions with and without parentheses.  For identifier tokens, what the identifier refers to (variable, array or user function) needs to be determined before a code can be assigned.

Initially only variables will be supported, so identifiers without parentheses will be assumed to be variables, and except for the constants, the other token types will report a "not yet implemented" error.  For a variable, there will be a total of six program codes including Double Variable, Integer Variable, String Variable, Double Reference Variable, Integer Reference Variable and String Reference Variable.  The specific code is selected based on the data type and reference flag of the token.

Later when support for arrays and functions is implemented, the dictionaries will be used to determine the type of the identifier.  For arrays, the attached arguments are integer subscripts, so each needs to be checked for an integer value.  For double type subscripts, a hidden integer conversion code will be inserted.  An error will be reported for string subscripts.  The number of subscripts will also be validated.  Similarly for function arguments, the data type of each argument will be checked adding numeric conversion codes or reporting errors as needed.

Step 2: The instruction size of each token will be determined.  Instructions are either one or two program words.  The token type can be used for this determination.  The operator and internal function (with and without parentheses) token types are one word, the command token type can be either and the others are two words.  For commands, a new table entry flag is needed for the size of each command.

The translated token list will be scanned while maintaining the total encoded size of the line.  For each token, an index (a new member for a token) will be set to the current total size.  The total size is then incremented for the encode size of the token.  This index will be used later for calculating the offset for single structure statements (like a single line IF statement).

Step 3: The encoded line is generated (its total size is now known).  For each token, the first instruction word is created from the code and sub-code of the token.  For two word instructions, the second operand word is determined.  For index values, the identifier is looked up in a dictionary and the second operand word is set to the index of the dictionary entry.  For offset values, the offset is calculated from the attached token.  Block numbers will probably work similar to index values with an associated block dictionary (this mechanism is not defined yet).

Once the line has been encoded, it can be inserted into the program.  At this point things get complicated.  Some dictionary entries will refer to specific locations in the code (consider the IF and END IF example from the previous post).  For any line inserted, replaced or deleted, all these references to program locations need to be adjusted if located after the point of change.  However, this is not a worry initially since only dictionaries for variables, constants and remarks are needed and these will not contain program locations.