C++/CLI Converting from String to wchar_t and to char*

I recently started working on a managed wrapper for the Terminal Services API, and as my C++/CLI is a bit rusty I ran into some issues which I'm sure are common when trying to handle the impedance mismatch between the managed and unmanaged worlds.

I'm going to take a look at one of those issues here, and that is using System::String with native functions. The Win32 API is one such body of native functions and they fairly consistently take LPTCHAR parameters for strings. This type is a typedef to TCHAR*, TCHAR in turn is a typedef to wchar_t for Unicode builds and char otherwise.

Toll-free bridge with CString

One easy and automatic way to do this conversion is to bridge it through CString, this is a type that is part of Microsoft's ATL. I believe it was first part of MFC but it's since been divorced from depending on the rest of MFC and even the ATL proper or it's server classes; include atlstr.h instead of cstringt.h.

The char type of CString is internally based on TCHAR as well, so depending on if you are doing a Unicode build the internal representation of the CString will be either wchar_t or char. The CLI has no such type differentiation, String is always a wide character string. This means that in Unicode builds we don't need to do a conversion, but non-Unicode builds will, and CString takes care of this for us through it's helpful conversion constructor that accepts a System::String^.


template <class SystemString>
CStringT( SystemString^ pString ) :
CThisSimpleString( StringTraits::GetDefaultManager() ) { cli::pinptr<const System::Char> pChar = PtrToStringChars( pString ); const wchart *psz = pChar; *this = psz; }

PtrToStringChars retrieves a pointer to the String's internal memory buffer, no copy here. The pointer returned is then pinned so that the Garbage Collector will not move it while we use it. It then uses an implicit conversion to go from pin_ptr&lt;const System::Char&gt; to const wchar_t*. Finally it uses the copy assignment operator of CString (which delegates to a base class operator) to copy the contents of the String buffer into itself. In a Unicode build this copy assignment operator decays to a basic memory copy, otherwise a different copy assignment operator is invoked that uses WideCharToMultiByte to convert to the CStrings internal char type.


// This gets called when the right-hand-side pszSrc is the same char type as the internal storage // It simply delegates to it's base class to copy the source buffer into itself CStringT& operator=( Inoptz PCXSTR pszSrc )
{ CThisSimpleString::operator=( pszSrc );

return( *this );

}

// This gets called when the right-hand-side pszSrc has a different char type than the internal storage // It converts to the internal storage char type directly into it's internal buffer CStringT& operator=( Inoptz PCYSTR pszSrc )
{ // nDestLength is in XCHARs int nDestLength = (pszSrc != NULL) ? StringTraits::GetBaseTypeLength( pszSrc ) : 0; if( nDestLength > 0 ) { PXSTR pszBuffer = GetBuffer( nDestLength ); StringTraits::ConvertToBaseType( pszBuffer, nDestLength, pszSrc); ReleaseBufferSetLength( nDestLength ); } else { Empty(); }

return( *this );

}

static void ConvertToBaseType(Outcap(nDestLength) _CharType* pszDest, _In int nDestLength,
Incount(nSrcLength) const wchart* pszSrc, In int nSrcLength = -1) throw() {
// nLen is in XCHARs ::WideCharToMultiByte(_AtlGetConversionACP(), 0, pszSrc, nSrcLength, pszDest, nDestLength, NULL, NULL); }

Now we can use the CString in place of LPCTSTR parameter because it has a user-defined conversion operator that simply returns the guts of the CString. This automatically does-the-right-thing when calling functions in Unicode builds and non-Unicode builds, all with no #ifdefs.


operator PCXSTR() const throw()
{ return( m_pszData ); }

PCXSTR is a chain of typedefs that eventually, through TCHAR, finds it's way to a const wchar_t*. PCXSTR means "pointer to a const null-terminated string of the same char type I am", it also has PXSTR and XCHAR typedefs. In addition to the X typedefs it also has a set of Y typedefs (PCYSTR, PYSTR, YCHAR) that map to the opposite of the X typedefs, if the X is wchar_t then the Y is char. It uses the set of Y typedefs to create a number of copy assignment operators and conversion constructors that convert from either character type to the internal type.

Conversion to Mutlibyte UTF8

There is one problem that this doesn't cover however, and that is how to convert your Unicode String into a const char* for use as parameters to functions that don't have Unicode counterparts. Stan Lippman has a blog post from back in 2004 where he presents a couple of functions to handle this conversion to char* and to std::string. We'll see some similarities with the CString constructors.


bool ToCharStar( String^ source, char*& target )
{ pin
ptr<const wchar_t> wch = PtrToStringChars( source ); int len = (( source->Length+1) * 2); target = new char[ len ]; return wcstombs( target, wch, len ) != -1; }

bool Tostring( String^ source, string &target )
{ pin
ptr<const wchar_t> wch = PtrToStringChars( source ); int len = (( source->Length+1) * 2); char *ch = new char[ len ]; bool result = wcstombs( ch, wch, len ) != -1; target = ch; delete ch; return result; }

There are a couple problems with these functions. First, they are not reentrant and hence not thread safe, wcstombs keeps a global internal state during the conversion of a string. Second, they only work with UTF-16, which String always is, but it over allocates for simple single-byte character sets like ASCII. Third, they create an unecissary temporary buffer in the case of the std::string converter. And in the case of the char* converter it put the onus of freeing the buffer on the caller, which is doubly dangerous here because since the function uses new[] to allocate the buffer, the caller needs to know to call delete[].

Lets take a whack at implementing these functions in the way of the C++ standard library while solving these deficiencies (thanks to Kniht on freenode ##C++ for helping me distill this).

include <limits.h>

include <wchar.h>

include <algorithm>

include <stdexcept>

struct ConversionError : std::runtimeerror {
ConversionError() : std::runtime
error("ConversionError") {}

explicit ConversionError(std::string const& what) : std::runtime_error("ConversionError: " + what) {}

protected:
struct NoPrefix {};

ConversionError(NoPrefix, std::string const& what) : std::runtime_error(what) {} };

template<class OutIter>
struct mboutputt : std::iterator<wchart,void,void,void,void> {
mboutput_t(OutIter out) : _mbstate(), _out(out) {}

mboutputt& operator++() { return *this; } mboutputt& operator++(int) { return this; } mboutput_t& operator () { return *this; }

void operator=(wchart wc) { char buf[MBLENMAX]; int len = ::wcrtomb(buf, wc, &mbstate); if (len == -1) { throw ConversionError("wcrtomb"); } _out = std::copy(buf, buf + len, _out); }

mbstate_t _mbstate; OutIter _out; };

template<class OutIter>
mboutputt<OutIter> mboutput(OutIter out) {
return mboutput
t<OutIter>(out); }

template<class Cont>
mboutputt<std::backinsertiterator<Cont> > mbbackinserter(Cont& c) {
return mboutput(std::back
inserter(c)); }

template<class Cont>
void wcstomb(Cont& c, wchart const* s) {
if (s) { std::copy(s, s + wcslen(s), mb
back_inserter(c)); } }

This code is completely standards compliant so it will work perfectly well on any standards compliant compiler/OS. In our examples however our source is a System::String and there are two targets we're interested in, std::string and char*. We can accomplish each of these targets easily using std::copy with our custom mbbackinserter.


template<class Cont>
void Stringtomb(Cont& c, System::String^ source) {
pinptr<const wchart> wch = ::PtrToStringChars( source ); wcstomb(c, wch); }

void ctest(const char* str) {
std::cout << str << std::endl; }

void stest(const std::string& str) {
std::cout << str << std::endl; }

void convtest() {
String^ sstr = L"Hello!";

std::vector<char> charStar;
String_to_mb(charStar, sstr);
// vector<char> can be used as a char* by passing &charStar[0] to a function taking char*
ctest(&charStar[0]);

std::string s;
String_to_mb(s, sstr);
// the std::string can be used as-is or it can also produce a char* by calling string::c_str()
stest(s);
ctest(s.c_str());

// the vector and string are automatically deallocated leaving this scope

}

Making effective use of containers, iterators, and wcrtomb(...) we've created a solution that doesn't require caller deallocation and that is reentrant.