Determining the Entropy of the Values We Added

This next procedure, determineEntropy, is the last one called from the cmdAnalyze button. What we want to do here is calculate something called (surprise!) entropy. The term comes from communication theory and is a precise measurement of information. Remember that we added the fields CountryLanguage, CountryRegion and Country to our ID3 table? What we now want to do is determine the relative impact of each of these categories on total sales growth. Entropy is a measure of the uncertainty of the classification. The smaller the number, the less 'uncertainty' that the specific classification is important to the result, sales growth. As the number increases, the more uncertainty that this classification really does influence the sales growth.

We want to see how the three categories: CountryLanguage, CountryRegion and Country compare in importance to sales growth. We will do that by calculating the entropy factor for each. The lesser of the three numbers indicates the factor that is the most important in sales growth.

Try It Out - Determining which Factors Impact Sales the Most!

1.  Please add another sub to the frmID3 form called determineEntropy. Add the following code to that procedure.

Private Sub determineEntropy()

Dim adoTemp As ADODB.Recordset
Dim sSql As String
Dim totalSamples As Integer
Dim entropyCountry As Single
Dim entropyCountryLanguage As Single
Dim entropyCountryRegion As Single
Dim Position(2, 1) As Variant 'holds the classifications

sbStatus.Panels.Item(1).Text = "Determining Entropy..."
sbStatus.Panels.Item(2).Text = "Working..."

adoConnection.Execute("DELETE * FROM ID3 WHERE Category = NULL")
'-----------------------------------------------------
'-- Determine how many records are in the ID3 Table --
'-----------------------------------------------------
Set adoTemp = New ADODB.Recordset
sSql = "SELECT count(*) as HowMany from ID3"
adoTemp.Open sSql, adoConnection
totalSamples = adoTemp!HowMany
adoTemp.Close

'----------------------------------------------------------
'-- Determine the relative Entropy on each of the fields --
'----------------------------------------------------------
entropycountryLanguage = getEntropy _
     ("CountryLanguage", totalSamples)
Position(0, 0) = entropycountryLanguage
Position(0, 1) = "CountryLanguage"
entropyCountryRegion = getEntropy("CountryRegion", totalSamples)
Position(1, 0) = entropyCountryRegion
Position(1, 1) = "CountryRegion"
entropyCountry = getEntropy("Country", totalSamples)
Position(2, 0) = entropyCountry
Position(2, 1) = "Country"

Call qsort(Position, LBound(Position), UBound(Position))

txtAnalysis = "ID3 Analysis of the Category " & lstCategory & _
    " and the Product " & lstProduct & vbCrLf
txtAnalysis = txtAnalysis & "The lesser the entropy, the more "
txtAnalysis = txtAnalysis & "important is this Attribute "
txtAnalysis = txtAnalysis & "to overall Sales" & vbCrLf
txtAnalysis = txtAnalysis & Position(0, 1) & " Entropy: " & _
    Position(0, 0) & vbCrLf
txtAnalysis = txtAnalysis & Position(1, 1) & " Entropy: " & _
    Position(1, 0) & vbCrLf
txtAnalysis = txtAnalysis & Position(2, 1) & " Entropy: " & _
    Position(2, 0) & vbCrLf
txtAnalysis = txtAnalysis & "Product Manager - Review sales to: "
txtAnalysis = txtAnalysis & Position(0, 1) & vbCrLf

Call gridTheResults

sbStatus.Panels.Item(1).Text = ""
sbStatus.Panels.Item(2).Text = "Ready."

End Sub

How It Works

We will essentially determine the entropy factor on each of our three classifications and present the results to the product manager. But we want to ensure we are working with clean and valid data. In our effort to scrub the data, we want to delete any records from our ID3 table that could have a NULL value in the Category field. Remember that we are working with legacy data here. We can never assume that it is perfect. We must be on guard to ensure the data we are working with is clean. Once we do that, we just determine how many records are left in our ID3 table. This number, totalSamples, is used to see how each of the categories we are looking at did compared with all of the samples.

adoConnection.Execute ("DELETE * FROM ID3 WHERE Category = NULL")
'-----------------------------------------------------
'-- Determine how many records are in the ID3 Table --
'-----------------------------------------------------
Set adoTemp = New ADODB.Recordset
sSql = "SELECT count(*) as HowMany from ID3"
adoTemp.Open sSql, adoConnection
totalSamples = adoTemp!HowMany
adoTemp.Close

Now we have a clean ID3 table. It only contains records where we have valid sales data for each country that actually ordered this product we are analyzing. Now that the table is clean, we are prepared to determine the entropy factor for the CountryLanguage, CountryRegion and Country. We have another routine, getEntropy, that will do the calculations for us.

entropyCountryLanguage = getEntropy _
     ("CountryLanguage", totalSamples)
Position(0, 0) = entropyCountryLanguage
Position(0, 1) = "CountryLanguage"
entropyCountryRegion = getEntropy("CountryRegion", totalSamples)
Position(1, 0) = entropyCountryRegion
Position(1, 1) = "CountryRegion"
entropyCountry = getEntropy("Country", totalSamples)
Position(2, 0) = entropyCountry
Position(2, 1) = "Country"

We call the getEntropy routine and pass it two parameters, the classification we want to get the entropy factor on and totalSamples - just how many records are in the recordset.

So in the first line, we call the getEntropy routine and pass in "CountryLanguage" and the totalSamples. We assign the value that is returned to us to entropyCountryLanguage. We then update the local variant array, Position. In the first row, first position (0, 0) put in the entropy value that was returned for the CountryLanguage. Then in the first row, second position (0, 1) we add the title, "CountryLanguage". We then do this for the CountryRegion and then Country. Now we don't know how the values should be ranked.

In order to sort our array, I have included a modified quick sort routine that will sort the contents of our array, Position, from the lowest entropy factor to the highest. We simply call this routine and pass in the array with its lower and upper bounds. This routine, which we will write in a minute, will place the lowest entropy value first. As mentioned, the lower the score, the better. This means that this particular classification has the least uncertainty.

Call qsort(Position, LBound(Position), UBound(Position))

Once the array has been sorted, we display the results in the txtAnalysis text box. We also instruct the product manager to look at the value in the lowest category - whatever that might be. This is the category that is the most relevant to sales growth for that particular product.

txtAnalysis = "ID3 Analysis of the Category " & lstCategory & _
    " and the Product " & lstProduct & vbCrLf
txtAnalysis = txtAnalysis & "The lesser the entropy, the more "
txtAnalysis = txtAnalysis & "important is this Attribute "
txtAnalysis = txtAnalysis & "to overall Sales" & vbCrLf
txtAnalysis = txtAnalysis & Position(0, 1) & " Entropy: " & _
    Position(0, 0) & vbCrLf
txtAnalysis = txtAnalysis & Position(1, 1) & " Entropy: " & _
    Position(1, 0) & vbCrLf
txtAnalysis = txtAnalysis & Position(2, 1) & " Entropy: " & _
    Position(2, 0) & vbCrLf
txtAnalysis = txtAnalysis & "Product Manager - Review sales to: "
txtAnalysis = txtAnalysis & Position(0, 1) & vbCrLf

Finally, we call our generalized routine, gridTheResults. This will display the details of the ID3 algorithm to the user so she can visually see the relationships between the classifications.

Call gridTheResults

sbStatus.Panels.Item(1).Text = ""
sbStatus.Panels.Item(2).Text = "Ready."

Let's add the gridTheResults subroutine now!

Try It Out - Displaying Results in the resultsGrid FlexGrid

1.  Please add the following new subroutine called gridTheResults to the frmID3 form.

Public Sub gridTheResults()

'------------------------------------------------
'-- Now let's update the grid with the regions --
'------------------------------------------------
Dim adoID3 As ADODB.Recordset
Dim sSql As String
Dim iRows As Integer
Dim iCols As Integer
Dim iRowLoop As Integer
Dim iColLoop As Integer

sSql = "SELECT UpByHowMuch, OldQuantity, NewQuantity,"
sSql = sSql & " CountryRegion, CountryLanguage, Country FROM"
sSql = sSql & " ID3 ORDER BY UpByHowMuch DESC,"
sSql = sSql & " CountryRegion, CountryLanguage"

Set adoID3 = New ADODB.Recordset
adoID3.CursorLocation = adUseClient

adoID3.Open sSql, adoConnection, , , adCmdText

adoID3.MoveFirst

iRows = adoID3.RecordCount
iCols = adoID3.Fields.Count

resultsGrid.Rows = iRows
resultsGrid.Cols = iCols
'--------------------------
'-- Set up the grid here --
'--------------------------
resultsGrid.Row = 0

For iColLoop = 0 To resultsGrid.Cols - 1
  With resultsGrid
     .Col = iColLoop
     .ColWidth(iColLoop) = 1400
     .ColAlignment(iColLoop) = 7
     Select Case iColLoop
       Case 0
         .Text = "Growth Factor"
         .MergeCol(iColLoop) = True
       Case 1
         .Text = "Previous Qty"
         .MergeCol(iColLoop) = True
       Case 2
         .Text = "Recent Qty"
         .MergeCol(iColLoop) = True
       Case 3
         .Text = "Country Region"
       Case 4
          .Text = "Country Language"
       Case 5
          .Text = "Country"
     End Select
  End With
Next

resultsGrid.MergeCells = flexMergeFree

For iRowLoop = 1 To iRows - 1
  For iColLoop = 0 To iCols - 1
     resultsGrid.Row = iRowLoop
     resultsGrid.Col = iColLoop
     resultsGrid.Text = adoID3.Fields(iColLoop)
  Next
  adoID3.MoveNext
Next

adoID3.Close

Set adoID3 = Nothing

End Sub

How It Works

This code is very similar to that in the gridID3 subroutine so I won't discuss this subroutine in much detail.

We build an SQL statement taking the UpByHowMuch, OldQuantity, NewQuantity, CountryRegion, CountryLanguage and Country fields from the ID3 table. We then ORDER BY UpByHowMuch (in DESC order), CountryRegion and CountryLanguage.

sSql = "SELECT UpByHowMuch, OldQuantity, NewQuantity,"
sSql = sSql & " CountryRegion, CountryLanguage, Country FROM"
sSql = sSql & " ID3 ORDER BY UpByHowMuch DESC,"
sSql = sSql & " CountryRegion, CountryLanguage"

As with the gridID3 code, we loop through all the rows and columns in our adoID3 recordset. We use a Select Case statement to set up the headings for the columns - renaming UpByHowMuch as Growth Factor, OldQuantity as Previous Qty and NewQuantity as Recent Qty. We allow the cells for UpByHowMuch, OldQuantity and NewQuantity to be merged by setting their MergeCol properties to True.

Select Case iColLoop
       Case 0
         .Text = "Growth Factor"
         .MergeCol(iColLoop) = True
       Case 1
         .Text = "Previous Qty"
         .MergeCol(iColLoop) = True
       Case 2
         .Text = "Recent Qty"
         .MergeCol(iColLoop) = True
       Case 3
         .Text = "Country Region"
       Case 4
          .Text = "Country Language"
       Case 5
          .Text = "Country"
     End Select

© 1998 by Wrox Press. All rights reserved.