C++ Q & A

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

August 1996

C++ Q&A

Paul DiLascia is a freelance software consultant specializing in training and software development in C++ and Windows. He is the author of Windows++: Writing Reusable Code in C++ (Addison-Wesley, 1992).

QCould you please comment on a discussion that we've been having in our development department about passing character strings into and out of C++ class objects? One side argues that they should always be passed as LPCSTRs, another argues for using the CString class.

I've enclosed a much-simplified class definition (see Figure 1) to illustrate the different methods considered. In reality our classes have a number of CString data members and, rather than having separate functions to set or get a particular item, we use the one function with an ID number to indicate which string we want to access.

Ian Clegg
England

AAt first I thought I knew the answer to this question off the top of my head-after all, it seems like such a simple, innocent question-but then I began to wonder. Many hours and countless brain cells later, I found myself sucked deeper and deeper into C++, MFC internals, and yucky assembly language mucky-muck. In the end, it turned out I was right, but only through brute force was I able to prove it. After such a grueling ordeal, I quickly realized that the only way I could reward myself, the only possible pleasure to be gained from having endured it, was to inflict the same punishment on my readers. In fact, since CStrings are so important and ubiquitous in MFC, a familiar little class that programmers use every day, I decided to make this month's column a sort of mini-treatise on CStrings-especially since they've changed considerably as of release 4.0. I know it sounds boring but I guarantee you'll be surprised to learn all the things that go on while you're not looking.

First off, if you're not using CStrings, shame on you! Come on guys, the year 2000 is almost upon us! No more character arrays, strcpy, strdup, and all that rot. CStrings are easy, lightweight, overwrite-proof, and provide useful functions for manipulating strings. There's even a Format function that works like printf! Not to mention that CStrings go from English to International (ASCII to Unicode) with the flip of a compiler switch. There's just no excuse for writing char[256] any more unless you're dealing with legacy code written in COBOL.

Now that I got that off my chest, let's do something about Ian's code. First, it's always better to instantiate CString objects directly as inline class members or on the stack instead of allocating them from the heap. A CString is very small; it contains only one member (m_pchData, a pointer to the actual character data) so a CString is only four bytes, the same as an int. It makes no sense to allocate CStrings individually unless you have some truly bizarre situation at hand. In general, you should think of CString as a primitive type like int or long or double. You wouldn't allocate space to store one int, would you? So the first thing you should do is make m_str an actual CString instead of a pointer to one.

 class MyClass {
      CString m_str;      // not a pointer
.
.
.
};

This entails no extra storage overhead. On the contrary, it uses half the space the previous design used and results in less memory fragmentation. It also simplifies your code greatly because you can get rid of all the new/delete stuff and checks for NULL. CString already contains a lot of code to do all that checking for you. Use it.

The second thing you should do is declare your Set/GetCString functions with const CString& (reference to const CString) instead of just plain CString. When you pass an object by value, C++ must make a copy of it on the stack, which requires a function call to the copy constructor CString::CString(const CString&). If you use a reference, C++ just pushes a pointer and there's no copy constructor call. When you use a reference, you need const to tell the compiler that your Set function doesn't modify its argument or, in the case of Get, that the CString returned may not be modified. In general, you can use const Foo& as a way to pass Foo objects more efficiently-as if they were values-provided you don't modify them. Figure 2 shows my modified version of MyClass with const CString& declarations and m_str converted to CString.

Now, let's explore the original question: should you declare Get/Set functions with LPCSTR or const CString&? If you take my initial advice to always use CString and never use LPCSTR, then this question never arises. However, LPCSTR is sometimes necessary, and the magic of C++ lets you use LPCSTR interchangeably with CString.

Say you have a function, like SetLPCSTR, that expects LPCSTR but you call it with a CString instead.

 MyClass myobj;
CString cs;
myobj.SetLPCSTR(cs);      // type mismatch?

Superficially it looks like a type mismatch, but this code compiles because CString has a conversion operator, CString::operator LPCSTR, that converts the CString to an LPCSTR. All the compiler needs to know is that there's this member function called operator LPCSTR (operator const char*) that returns LPCSTR. The compiler generates code like this:

       .
      .    
      .
CString cs;
myobj.SetLPCSTR(cs.operator LPCSTR()); // OK, types 
                                       // agree

This looks funny because there's a space in the function name, but that's just syntax. Internally, operator LPCSTR is just another member function that returns LPCSTR. SetLPCSTR gets LPCSTR, which is what it expects.

What about going the other way? What if you have a Set function that expects const CString& and you try to give it an LPCSTR?

 LPCSTR lp;
myobj.SetCString(lp);      // type mismatch?

This is a little more tricky. One of the functions defined for CString is CString::CString(LPCSTR), a constructor that creates a CString from an LPCSTR. The compiler notices this and says, "Duh, I can make this compile if I create a temporary variable."

 LPCSTR lp;
CString temp(lp);         // create temp
myobj.SetCString(temp);   // OK, args match

Once again, the types match: SetCString gets a CString, which is what it expects.

There are two other things I must point out here. First, hidden behind the scenes is a call to the destructor CString::~CString as temp goes out of scope. Second, the temp solution only works if the argument to SetCString is declared either CString or const CString&. If SetCString is declared to take CString& (a non-const reference), the compiler can't use the temp trick. For all it knows, SetCString might modify temp, and there's no way to propagate the change back to lp.

However you declare your arguments-CString or LPCSTR-you can still pass the other kind of argument in your code. Which is better? I'm getting there, I promise.

So far, I've only showed you what happens for converting function arguments. As you'd expect, the compiler works the same magic on return values. You can write

 LPCSTR lp;
CString cs;
lp = myobj.GetCString();   // type mismatch?
cs = myobj.GetLPCSTR();    // type mismatch?

and C++ works its gris-gris to make your code compile. In the first case, C++ converts the return value from const CString& to LPCSTR by invoking the conversion operator CString::operator LPCSTR. In the second case, the conversion is actually an assignment: C++ invokes CString::
operator=(LPCSTR).

In all, there are eight cases to consider: four cases for Set and four cases for Get, depending on the type declared versus the type passed or assigned. In addition to the hidden conversions for arguments and return values, you also have to consider what happens inside your Set/Get functions. For example, if you write

 void MyClass::SetLPCSTR(LPCSTR lpsz)
{ 
      m_str = lpsz;
}

the innocent-looking assignment statement actually compiles into a call to CString::operator=(LPCSTR). Likewise, you have to consider what happens for SetCString, GetLPCSTR, and GetCString. Things are really getting out of hand here!

In an effort to get a handle on all this madness, I wrote a program, STRTEST.CPP (see Figure 2), that illustrates exactly what happens in each situation. STRTEST contains the improved MyClass with Set/Get functions for CString and LPCSTR and a main function that exercises each of the eight cases I mentioned. It also contains a stripped-down version of CString, with only the functions declared that are relevant to the discussion at hand. All functions are left outline (as opposed to inline) so you can see where the compiler generates function calls.

The idea is to compile STRTEST and look at the assembly code generated in the hopes of understanding what's really going on behind the veil of the compiler. This is the brute force investigative technique I mentioned at the outset. It's disgusting to look at, I know, but it's also amusing. Figure 3 shows the abridged assembly output for the main function, with my running commentary.

You'd think by now I would just come out and tell you the answer, but I've only described the type conversions generically. The next thing you have to do is look inside CString to see what all these operators and constructors actually do. Fortunately, this is a little more interesting. Consider the conversion operator for LPCSTR. I mentioned earlier that CString contains just one member, m_pchData, a char* that points to the actual character data, such as "Hello, world". Knowing this, you can probably guess how CString:: operator LPCSTR is implemented.

 // (from afx.inl)
inline CString::operator LPCTSTR() const
{ 
      return m_pchData;    // just return ptr to string
}

Just like a typical Get function, all it does is return a data member. Since it's inline, converting a CString to LPCSTR is very fast. If you write

 SetLPCSTR(cs);       // cs is a CString

it gets compiled exactly as if you'd written

 SetLPCSTR(cs.m_pchData);

which you can't do because m_pchData is protected.

What about the other operators? Well, when I told you about m_pchData, I didn't tell you everything. It's true that m_pchData points to the underlying character string, but hidden behind the string is a little struct.

 struct CStringData {
      long nRefs;         // reference count
      int  nDataLength;   // length of string
      int  nAllocLength;  // length of buffer allocated
};

Figure 4 illustrates the situation. When CString allocates space for a new string, it adds a few extra bytes to store this header. CStringData contains vital information about the string. For example, CString::GetStringLength is implemented like this:

 // (from afx.inl)
inline int CString::GetLength() const
{ 
      return GetData()->nDataLength; 
}

GetData is another inline function:

 
inline CStringData* CString::GetData() const
{ 
      ASSERT(m_pchData != NULL); 
      return ((CStringData*)m_pchData)-1; 
}

Figure 4 Anatomy of a CString

Why did the implementers of MFC put the CStringData information as a hidden block preceding the character data instead of storing it as class members in CString, which would be the obvious thing to do? Because it makes CStrings small and fast. Consider what happens when you copy a CString in either a copy constructor or an assignment from CString to CString. If all the information is stored in the CString, as it was before MFC 4.0, you'd have to copy it along with m_pchData, so there would be more things to copy. Plus, you can't just copy the value of m_pchData, you have to allocate a new buffer and copy the contents with a function like strcpy or memcpy.

Starting in release 4.0, MFC uses a different technique called "copy on modify" to copy CStrings. Commercial string libraries have long used this technique; MFC finally caught up. The basic idea is to copy only the pointer at first, and not actually copy the bytes until it becomes necessary. Figure 5 shows how it works.

Figure 5 CString Copy on Modify in Action!

Say you have a CString, cstr1, with a ref count of 1. Then suppose you make a copy of it.

 cstr2 = cstr1;

Instead of copying all the string information and character bytes, the assignment operator copies the pointer m_pchData and increments CStringData::nRefs. Now cstr1 and cstr2 actually point to the same object in memory, but nRefs is 2 instead of 1. This makes two CStrings, but just one byte array. What happens if the program subsequently alters either cstr1 or cstr2? No problem. Before modifying any CString, MFC checks the ref count. If it's greater than 1, some other CString is pointing to this same m_pchData so MFC can't change it. Instead, MFC allocates a new m_pchData with its own CStringData and copies the bytes. MFC decrements the ref count in the original object and sets the new ref count to 1. A similar thing happens when a CString is destroyed; only when the ref count drops to zero does MFC actually deallocate m_pchData. You can see that this strategy only works because the information about the string-CStringData-is kept with the string itself and CStrings are just pointers to these data/string objects. Figure 6 summarizes what all the relevant CString functions and operators do with regard to copying.

Figure 6 CString Functions and Operators

CString function/operator	Costly?	What it does
CString::CString(const CString& cs)	No	Quick copy. Copy value of m_pchData and increment CStringData:::nRefs.
CString::CString(LPCSTR lp)	Yes	Always allocate a new character array and CStringData. Copy bytes from lp.
CString::~CString()	No	Deallocate string only if -nRefs <= 0; that is, if this is the only CString using this particular m_pchData.
operator LPCSTR() const	No	Inline function just returns m_pchData. No function call.
const CString& operator=(const CString& cs)	No	Similar to copy constructor. Copy value of m_pchData and increment nRefs.
const CString& operator=(LPCSTR lp);	Yes	Similar to LPCSTR constructor. Always allocate a new character array and CStringData. Copy bytes from lp.

The whole point is that copying CStrings is now very fast since you just copy one pointer. You can pass CStrings around by value without paying a price. A typical application might have many functions with arguments declared CString, and you might pass the same CString by value from function A to function B to function C. Each call requires creating a copy of the CString on the stack. Before MFC 4.0, this would allocate and copy a new string every time! Copy on modify fixes this situation so only m_pchData is copied. As soon as one of the functions or some other part of the code attempts to modify the underlying string, MFC makes a new copy.

Remember, this only applies when you pass CStrings by value. If you use const references (const CString&), C++ passes a pointer to the actual CString and doesn't even call the copy constructor.

Finally, I'm in a position to answer the question! I could have made you wade through the assembler code, but I have some sympathy and did the dirty work myself. I compiled the results in two tables that summarize what happens in the eight different cases in the main function of STRTEST.CPP (see Figures 7 and 8).

And the winner is CString! If you want to maximize performance, you should declare your Set/Get functions using CString, not LPCSTR.

 class MyClass {
      CString m_foo;
public:
      void  SetFoo(const CString& cs) { m_str = cs; }
      const CString& GetFoo();        { return m_str; }
};

Why? Well, it should be obvious from Figure 8 that CString is the way to go for the Get function. Case 6, where you return LPCSTR and then assign it to a CString, is the one to avoid because it always does an allocation when the CString is assigned to an LPCSTR.

The Set function is a little more subtle. At first glance, it seems like case 3 is really bad because it not only creates a temp variable, but the temp variable must be destroyed as well. When you look at it again, you realize that the m_pchData created for temp is immediately copied to m_str inside the Set function and has its ref count bumped up. But, when temp is subsequently destroyed, nothing happens because m_str is now using the same m_pchData that was originally created for temp! In other words, the underlying string is allocated only once and then handed to m_str, where it resides until m_str is destroyed. This is essentially the same overhead as case 4, only the allocation happens inside the Set function in operator=(LPCSTR). Having a Set function that takes LPCSTR doesn't really buy you anything. (There are a few extra pushes and pops associated with creating the temp variable, but that's negligible.)

The moral of the story is, use const CString& in all your declarations. This makes sense; m_str is already a CString so why convert it? LPCSTRs will have to be converted one way or another, so let the compiler do it when necessary. If you convert the m_str to LPCSTR in your Set/Get functions, you'll only have to convert back again in the case where you have a CString. Phew

Have a question about programming in C or C++? Send it to Paul DiLascia at 72400.2702@compuserve.com

From the August 1996 issue of Microsoft Systems Journal.