Most of our content files are kept in RTF format. Even the tags are embedded within an RTF stream. To effectively deal with these content files, our compiler has to be able to:
Of the types of information Windows Word stores above the text in an RTF file, we are concerned only with the RTF definition, the font table, and the color table. Upon opening an input file, the compiler reads information until it comes to the end of the RTF header and then passes the header to this routine:
Function bProcessFrontMatter (sRtfBuff As String) As Integer Dim nStartFont As Integer Dim nStartColor As Integer Dim nStartStyle As Integer Rem get define and char set nStartFont = InStr(sRtfBuff, "{\fonttbl") If nStartFont = 0 Then bProcessFrontMatter = False MsgBox "Font table not found in RTF file" Exit Function End If msRtfStarter = Left$(sRtfBuff, nStartFont - 1) Rem get font table nStartColor = InStr(sRtfBuff, "{\colortbl") If nStartColor = 0 Then bProcessFrontMatter = False MsgBox "Color table not found in RTF file" Exit Function End If msRtfFontTable = Mid$(sRtfBuff, nStartFont, nStartColor - nStartFont) 'kill alt fonts which our control can't handle msRtfFontTable = sReplaceString(msRtfFontTable, "{\*\falt^?}", "") Rem get color table nStartStyle = InStr(sRtfBuff, "{\stylesheet") If nStartStyle = 0 Then bProcessFrontMatter = False MsgBox "Stylesheet not found in RTF file" Exit Function End If msRtfColorTable = Mid$(sRtfBuff, nStartColor, nStartStyle - nStartColor) bProcessFrontMatter = True End Function
The routine does simple string manipulations to segment the RTF header into usable chunks. Later, as text content is added to memo fields in the Screen and Item tables, this header information is prepended to form a complete RTF stream in each field. Note that this routine assumes that there is no file table stored in the RTF file.
Once the header information has been processed, the compiler reads the input file one paragraph at a time. A paragraph is the right amount of data to read (as opposed to a line or a fixed number of bytes) because that is how data is marked in the RTF stream. In RTF, paragraphs are separated by a "\par" token.
The following routine reads characters from the input file until the paragraph token is found, the number of characters in the paragraph exceeds 20,000, or the end of the file is reached.
Function sGetALine (hInFile As Integer) As String Dim sBuff As String Dim nchars As Integer Do While Right$(sBuff, 5) <> "\par " sBuff = sBuff & Input(1, #hInFile) If nchars > 20000 Then Exit Do End If nchars = nchars + 1 If EOF(hInFile) Then Exit Do Loop sGetALine = sBuff End Function
In most cases, the paragraph does not exceed 20,000 characters and the routine returns a full paragraph. However, our system allows embedded OLE objects (mostly images). Which can drive the size of a paragraph past the maximum size of a Visual Basic string variable. To handle these cases we exit the routine without a full paragraph and use the AppendChunk method to temporarily cache data while it was moved from the source file to the database.
Paragraphs are buffered in a string variable until the size of the variable exceeds 20,000. At that point, the buffer is flushed to temporary storage in the database.
If Len(sRtfBuffer) + Len(sCurrLine) > 20000 Then FlushToTmp(sRtfBuffer) sRtfBuffer = "" End If sRtfBuffer = sRtfBuffer + sCurrLine & gsCR
The advertised maximum size of a string variable is 65K. From experience we have found that 20,000 characters is about the maximum we can buffer in this manner without causing a string error in Visual Basic.
The FlushToTmp routine simply appends the information it is passed to a pre-defined temporary holding place in the database.
Sub FlushToTmp (sText As String) mtbTmp("Text").AppendChunk sText End Function
Later, when the paragraph buffer is ready to be saved to it's correct place in the database, the application checks to see if there is anything in the temporary location. If there is, it is transferred to from the temporary to the permanent location in the database.
An RTF file is a series of embedded clauses. Each clause begins and ends with curly brackets ({}). For example, to make the words "My Computer" bold the RTF stream could bracket the words with a bold clause:
{\b My Computer}
If only the "o" is bolded the clause could look like this:
My C{\b o}mputer
If an author applies character formatting (such as bold) across a paragraph boundary then the opening curly bracket will be in the first paragraph and the closing curly bracket will be in the second paragraph. Because our compiler brakes RTF files at paragraph boundaries we need to make sure that no character formatting crosses these boundaries. Also, if an author embeds character formatting within one of the tags in the file, then the compiler may not recognize it as a valid tag.
The easiest way to assure that tags contain no embedded formatting and that character formatting does not cross the last paragraph boundary in a series of paragraphs is to use WordBasic to remove character formatting form the crucial areas.
The following WordBasic routine finds the boundaries between screens in an input file and removes any character formatting.
Sub NixFormat StartOfDocument EditFind .Find = "..SCREEN:", .Direction = 0, .Format = 0, .Wrap = 0 While EditFindFound() ParaUp CharLeft ParaDown 2, 1 ResetChar ParaDown EditFind Wend End Sub