Sorting It All Out

Uniscribe from Visual Basic

Over the past six weeks I have received eight different requests for information on how to get various bits of Uniscribe to work from Visual Basic, in VB5 or VB6.

Although there is sample code to do this in my book, the book is out of print now (even though it was released less than six years ago, it is also three product versions ago, and the publisher seems to prefer to focus on newer stuff.

In any case, provided here is the code mostly from the book that is used to call Uniscribe from VB in the "Light Edit" control sample that is a modified version of the one created originally by Matt Curland for his book (Advanced Visual Basic 6: Power Techniques for Everyday Programs).

I included VB-ized versions of some of the Uniscribe structs not used in the UniscribeExtTextOutW wrapper that most of the code is made to support. But perhaps someone will find it useful anyway, so I left it in.

I had mostly written a .NET version of the code but then in VS 2005 the TextRenderer class I talked about here and here made it less necessary -- since it is now built-in to .NET....

If you want an in-depth explanation of everything I did for the light edit control to support Unicode input, clipboard, and rendering support, you'll have to find the book (I am hoarding my last copies, sorry!). But here is a nice bit of the Uniscribe sample....

'--------------------------------
'   Windows API enumerations
Public Enum GCPCLASS
     GCPCLASS_LATIN = 1
     GCPCLASS_ARABIC = 2
     GCPCLASS_HEBREW = 2
     GCPCLASS_NEUTRAL = 3
     GCPCLASS_LOCALNUMBER = 4
     GCPCLASS_LATINNUMBER = 5
     GCPCLASS_LATINNUMERICTERMINATOR = 6
     GCPCLASS_LATINNUMERICSEPARATOR = 7
     GCPCLASS_NUMERICSEPARATOR = 8
     GCPCLASS_POSTBOUNDRTL = &H10
     GCPCLASS_PREBOUNDLTR = &H40
     GCPCLASS_PREBOUNDRTL = &H80
     GCPCLASS_POSTBOUNDLTR = &H20
     GCPGLYPH_LINKAFTER = &H4000
     GCPGLYPH_LINKBEFORE = &H8000
End Enum

'--------------------------------
' Windows API types
Public Type ABC
    abcA As Long
    abcB As Long
    abcC As Long
End Type

'--------------------------------
' Uniscribe ENUMs
Public Enum SCRIPT
    SCRIPT_UNDEFINED = 0
End Enum

Public Enum SCRIPT_JUSTIFY
    SCRIPT_JUSTIFY_NONE = 0
    SCRIPT_JUSTIFY_ARABIC_BLANK = 1
    SCRIPT_JUSTIFY_CHARACTER = 2
    SCRIPT_JUSTIFY_RESERVED1 = 3
    SCRIPT_JUSTIFY_BLANK = 4
    SCRIPT_JUSTIFY_RESERVED2 = 5
    SCRIPT_JUSTIFY_RESERVED3 = 6
    SCRIPT_JUSTIFY_ARABIC_NORMAL = 7
    SCRIPT_JUSTIFY_ARABIC_KASHIDA = 8
    SCRIPT_JUSTIFY_ARABIC_ALEF = 9
    SCRIPT_JUSTIFY_ARABIC_HA = 10
    SCRIPT_JUSTIFY_ARABIC_RA = 11
    SCRIPT_JUSTIFY_ARABIC_BA = 12
    SCRIPT_JUSTIFY_ARABIC_BARA = 13
    SCRIPT_JUSTIFY_ARABIC_SEEN = 14
    SCRIPT_JUSTIFY_RESERVED4 = 15
End Enum

Public Enum SSA_FLAGS
    SSA_PASSWORD = &H1            ' Input string contains a single character to be duplicated iLength times
    SSA_TAB = &H2                 ' Expand tabs
    SSA_CLIP = &H4                ' Clip string at iReqWidth
    SSA_FIT = &H8                 ' Justify string to iReqWidth
    SSA_DZWG = &H10               ' Provide representation glyphs for control characters
    SSA_FALLBACK = &H20           ' Use fallback fonts
    SSA_BREAK = &H40              ' Return break flags (character and word stops)
    SSA_GLYPHS = &H80             ' Generate glyphs, positions and attributes
    SSA_RTL = &H100               ' Base embedding level 1
    SSA_GCP = &H200               ' Return missing glyphs and LogCLust with GetCharacterPlacement conventions
    SSA_HOTKEY = &H400            ' Replace '&' with underline on subsequent codepoint
    SSA_METAFILE = &H800          ' Write items with ExtTextOutW Unicode calls, not glyphs
    SSA_LINK = &H1000             ' Apply FE font linking/association to non-complex text
    SSA_HIDEHOTKEY = &H2000       ' Remove first '&' from displayed string
    SSA_HOTKEYONLY = &H2400       ' Display underline only.

    ' Internal flags
    SSA_PIDX = &H10000000         ' Internal
    SSA_LAYOUTRTL = &H20000000    ' Internal - Used when DC is mirrored
    SSA_DONTGLYPH = &H40000000    ' Internal - Used only by GDI during metafiling - Use ExtTextOutA for positioning
End Enum

Public Enum SCRIPT_IS_COMPLEX_FLAGS
    SIC_COMPLEX = 1      ' Treat complex script letters as complex
    SIC_ASCIIDIGIT = 2   ' Treat digits U+0030 through U+0039 as copmplex
    SIC_NEUTRAL = 4      ' Treat neutrals as complex
End Enum

Public Enum SCRIPT_DIGITSUBSTITUTE_FLAGS
    SCRIPT_DIGITSUBSTITUTE_CONTEXT = 0       ' Substitute to match preceeding letters
    SCRIPT_DIGITSUBSTITUTE_NONE = 1          ' No substitution
    SCRIPT_DIGITSUBSTITUTE_NATIONAL = 2      ' Substitute with official national digits
    SCRIPT_DIGITSUBSTITUTE_TRADITIONAL = 3   ' Substitute with traditional digits of the locale
End Enum

Public Enum SCRIPT_GET_CMAP_FLAGS
    SGCM_RTL = &H1&             ' Return mirrored glyph for mirrorable Unicode codepoints
End Enum

'--------------------------------
'   Uniscribe Types

' This is the C-friendly version of SCRIPT_DIGITSUBSTITUTE_VB
' which will be packed properly
Public Type SCRIPT_DIGITSUBSTITUTE
    NationalDigitLanguage As Integer
    TraditionalDigitLanguage As Integer
    DigitSubstitute As Byte
    dwReserved As Long
End Type

' This is the C-friendly version of SCRIPT_CONTROL_VB
' which will be packed properly
Public Type SCRIPT_CONTROL
    uDefaultLanguage As Integer
    fBitFields As Byte
    fReserved As Integer
End Type

' This is the C-friendly version of SCRIPT_STATE_VB
' which will be packed properly
Public Type SCRIPT_STATE
    fBitFields1 As Byte
    fBitFields2 As Byte
End Type

' This is the C-friendly version of SCRIPT_VISATTR_VB
' which will be packed properly
Public Type SCRIPT_VISATTR
    uJustification As SCRIPT_JUSTIFY
    fBitFields1 As Byte
    fBitFields2 As Byte
End Type

' This is the C-friendly version of SCRIPT_ANALYSIS_VB
' which will be packed properly
Public Type SCRIPT_ANALYSIS
    fBitFields1 As Byte
    fBitFields2 As Byte
    s As SCRIPT_STATE
End Type

' This is the C-friendly version of SCRIPT_LOGATTR_VB
' which will be packed properly
Public Type SCRIPT_LOGATTR
    fBitFields As Byte
End Type

Public Type SCRIPT_CACHE
    p As Long
End Type

Public Type SCRIPT_FONTPROPERTIES
    cBytes As Long
    wgBlank As Integer
    wgDefault As Integer
    wgInvalid As Integer
    wgKashida As Integer
    iKashidaWidth As Long
End Type

' UNDONE: This struscture may not work well
' for using SCRIPT_PROPERTIES because it may
' not be aligned properly. Why oh why did they
' have to use bitfields?
Public Type SCRIPT_PROPERTIES
    langid As Integer
    fBitFields(1 To 3) As Byte
End Type

Public Type SCRIPT_ITEM
    iCharPos As Long
    a As SCRIPT_ANALYSIS
End Type

Public Type GOFFSET
    du As Long
    dv As Long
End Type

Public Type SCRIPT_TABDEF
    cTabStops As Long
    iScale As Long
    pTabStops() As Long
    iTabOrigin As Long
End Type

' We do not use this struct since we have to pass it ByVal
' some times and ByRef other times. All it is a pointer to a
' BLOB of data in memory, anyway, so we will use a Long
Public Type SCRIPT_STRING_ANALYSIS
    p As Long
End Type

'--------------------------------
'   VB friendly versions of Uniscribe Types

' You will have to use SCRIPT_CONTROL to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_CONTROL_VB
    uDefaultLanguage As Long    ' :16
    fContextDigits As Byte ' As Long   :1
    fInvertPreBoundDir As Byte ' As Long   :1
    fInvertPostBoundDir As Byte ' As Long   :1
    fLinkStringBefore As Byte   ' As Long   :1
    fLinkStringAfter As Byte    ' As Long   :1
    fNeutralOverride As Byte    ' As Long   :1
    fNumericOverride As Byte    ' As Long   :1
    fLegacyBidiClass As Byte    ' As Long   :1
    fReserved As Byte   ' As Long   :8
End Type

' You will have to use SCRIPT_STATE to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_STATE_VB
    uBidiLevel As Integer   ':5
    fOverrideDirection As Integer   ':1
    fInhibitSymSwap As Integer ':1
    fCharShape As Integer   ':1
    fDigitSubstitute As Integer ':1
    fInhibitLigate As Integer   ':1
    fDisplayZWG As Integer ':1
    fArabicNumContext As Integer    ':1
    fGcpClusters As Integer ':1
    fReserved As Integer    ':1
    fEngineReserved As Integer ':2
End Type

' You will have to use SCRIPT_VISATTR to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_VISATTR_VB
       uJustification As SCRIPT_JUSTIFY ':4
       fClusterStart As Integer ':1
       fDiacritic As Integer    ':1
       fZeroWidth As Integer    ':1
       fReserved As Integer ':1
       fShapeReserved As Integer    ':8
End Type

' You will have to use SCRIPT_ANALYSIS to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_ANALYSIS_VB
    eScript As Integer ':10
    fRTL As Integer ':1
    fLayoutRTL As Integer   ':1
    fLinkBefore As Integer ':1
    fLinkAfter As Integer   ':1
    fLogicalOrder As Integer    ':1
    fNoGlyphIndex As Integer    ':1
    s As SCRIPT_STATE
End Type

' You will have to use SCRIPT_LOGATTR to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_LOGATTR_VB
    fSoftBreak As Byte ':1
    fWhiteSpace As Byte ':1
    fCharStop As Byte   ':1
    fWordStop As Byte   ':1
    fInvalid As Byte    ':1
    fReserved As Byte   ':3
End Type

' You will have to use SCRIPT_PROPERTIES to call the
' API to make sure the structure is packed properly
Public Type SCRIPT_PROPERTIES_VB
    langid As Long ':16
    fNumeric As Long    ':1
    fComplex As Long    ':1
    fNeedsWordBreaking As Long ':1
    fNeedsCaretInfo As Long ':1
    bCharSet As Long    ':8
    fControl As Long    ':1
    fPrivateUseArea As Long ':1
    fNeedsCharacterJustify As Long ':1
    fInvalidGlyph As Long   ':1
    fInvalidLogAttr As Long ':1
    fCDM As Long    ':1

    ' Added in later versions of UNISCRIBE (usp10.h)
    fAmbiguousCharSet As Long   ':1
    fClusterSizeVaries As Long ':1
    fRejectInvalid As Long ':1
End Type

'--------------------------------
'   Uniscribe APIs
Declare Function ScriptApplyDigitSubstitution Lib "usp10.dll" ( _
psds As SCRIPT_DIGITSUBSTITUTE, _
psc As SCRIPT_CONTROL, _
pss As SCRIPT_STATE _
) As Long

Declare Function ScriptApplyLogicalWidth Lib "usp10.dll" ( _
piDx() As Long, _
ByVal cChars As Long, _
ByVal cGlyphs As Long, _
pwLogClust() As Integer, _
psva As SCRIPT_VISATTR, _
piAdvance() As Long, _
pSA As SCRIPT_ANALYSIS, _
pABC As ABC, _
piJustify As Long _
) As Long

Declare Function ScriptBreak Lib "usp10.dll" ( _
pwcChars As Long, _
ByVal cChars As Long, _
pSA As SCRIPT_ANALYSIS, _
psla As SCRIPT_LOGATTR _
) As Long

Declare Function ScriptCPtoX Lib "usp10.dll" ( _
ByVal iCP As Long, _
ByVal fTrailing As Long, _
ByVal cChars As Long, _
ByVal cGlyphs As Long, _
pwLogClust As Integer, _
psva As SCRIPT_VISATTR, _
piAdvance As Long, _
pSA As SCRIPT_ANALYSIS, _
piX As Long _
) As Long

Declare Function ScriptCacheGetHeight Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
tmHeight As Long _
) As Long

Declare Function ScriptFreeCache Lib "usp10.dll" ( _
psc As SCRIPT_CACHE _
) As Long

Declare Function ScriptGetCMap Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
ByVal pwcInChars As Long, _
ByVal cChars As Long, _
ByVal dwFlags As SCRIPT_GET_CMAP_FLAGS, _
pwOutGlyphs() As Integer _
) As Long

Declare Function ScriptGetFontProperties Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
sfp As SCRIPT_FONTPROPERTIES _
) As Long

Declare Function ScriptGetGlyphABCWidth Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
ByVal wGlyph As Integer, _
pABC As ABC _
) As Long

Declare Function ScriptGetLogicalWidths Lib "usp10.dll" ( _
pSA As SCRIPT_ANALYSIS, _
ByVal cChars As Long, _
ByVal cGlyphs As Long, _
piGlyphWidth() As Long, _
pwLogClust() As Integer, _
psva As SCRIPT_VISATTR, _
piDx As Long _
) As Long

Declare Function ScriptGetProperties Lib "usp10.dll" ( _
ppSp As SCRIPT_PROPERTIES, _
piNumScripts As Long _
) As Long

Declare Function ScriptIsComplex Lib "usp10.dll" ( _
ByVal pwcInChars As Long, _
ByVal cInChars As Long, _
ByVal dwFlags As SCRIPT_IS_COMPLEX_FLAGS _
) As Long

Declare Function ScriptItemize Lib "usp10.dll" ( _
ByVal pwcInChars As Long, _
ByVal cInChars As Long, _
ByVal cMaxItems As Long, _
psControl As SCRIPT_CONTROL, _
psState As SCRIPT_STATE, _
pItems() As SCRIPT_ITEM, _
pcItems As Long _
) As Long

Declare Function ScriptJustify Lib "usp10.dll" ( _
psva As SCRIPT_VISATTR, _
piAdvance() As Long, _
ByVal cGlyphs As Long, _
ByVal iDx As Long, _
ByVal iMinKashida As Long, _
piJustify() As Long _
) As Long

Declare Function ScriptLayout Lib "usp10.dll" ( _
ByVal cRuns As Long, _
pbLevel() As Byte, _
piVisualToLogical() As Long, _
piLogicalToVisual() As Long _
) As Long

Declare Function ScriptPlace Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
pwGlyphs() As Integer, _
ByVal cGlyphs As Long, _
psva As SCRIPT_VISATTR, _
pSA As SCRIPT_ANALYSIS, _
piAdvance() As Long, _
pGoffset As GOFFSET, _
pABC As ABC _
) As Long

Declare Function ScriptRecordDigitSubstitution Lib "usp10.dll" ( _
ByVal Locale As Long, _
psds As SCRIPT_DIGITSUBSTITUTE _
) As Long

Declare Function ScriptShape Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
ByVal pwcChars As Long, _
ByVal cChars As Long, _
ByVal cMaxGlyphs As Long, _
pas As SCRIPT_ANALYSIS, _
pwOutGlyphs() As Integer, _
pwLogClust() As Integer, _
psva As SCRIPT_VISATTR, _
pcGlyphs As Long _
) As Long

Declare Function ScriptTextOut Lib "usp10.dll" ( _
ByVal hdc As Long, _
psc As SCRIPT_CACHE, _
ByVal x As Long, _
ByVal y As Long, _
ByVal fuOptions As ETOFlags, _
lprc As RECT, _
pSA As SCRIPT_ANALYSIS, _
ByVal pwcReserved As Long, _
ByVal iReserved As Long, _
pwGlyphs() As Integer, _
ByVal cGlyphs As Long, _
piAdvance() As Long, _
piJustify As Any, _
pGoffset As GOFFSET _
) As Long

Declare Function ScriptXtoCP Lib "usp10.dll" ( _
ByVal iX As Long, _
ByVal cChars As Long, _
ByVal cGlyphs As Long, _
pwLogClust() As Integer, _
psva As SCRIPT_VISATTR, _
piAdvance() As Long, _
pSA As SCRIPT_ANALYSIS, _
piCP As Long, _
piTrailing As Long _
) As Long

'--------------------------------
'   Uniscribe Script* APIs
Declare Function ScriptStringAnalyse Lib "usp10.dll" ( _
ByVal hdc As Long, _
ByVal pString As Long, _
ByVal cString As Long, _
ByVal cGlyphs As Long, _
ByVal iCharset As Long, _
ByVal dwFlags As SSA_FLAGS, _
ByVal iReqWidth As Long, _
ByRef psControl As Any, _
ByRef psState As Any, _
ByRef piDx As Long, _
ByRef pTabdef As Any, _
ByRef pbInClass As GCPCLASS, _
ByRef pssa As Long _
) As Long

Declare Function ScriptStringCPtoX Lib "usp10.dll" ( _
ByVal ssa As Long, _
ByVal iCP As Long, _
ByVal fTrailing As Long, _
pX As Long _
) As Long

Declare Function ScriptStringFree Lib "usp10.dll" ( _
ByRef pssa As Long _
) As Long

Declare Function ScriptStringGetLogicalWidths Lib "usp10.dll" ( _
ByVal ssa As Long, _
piDx() As Long _
) As Long

Declare Function ScriptStringGetOrder Lib "usp10.dll" ( _
ByVal ssa As Long, _
puOrder As Long _
) As Long

Declare Function ScriptStringOut Lib "usp10.dll" ( _
ByVal ssa As Long, _
ByVal iX As Long, _
ByVal iY As Long, _
ByVal uOptions As ETOFlags, _
prc As RECT, _
ByVal iMinSel As Long, _
ByVal iMaxSel As Long, _
ByVal fDisabled As BOOL _
) As Long

Declare Function ScriptString_pcOutChars Lib "usp10.dll" ( _
ByVal ssa As Long _
) As Long

Declare Function ScriptString_pLogAttr Lib "usp10.dll" ( _
ByVal ssa As Long _
) As Long

Declare Function ScriptString_pSize Lib "usp10.dll" ( _
ByVal ssa As Long _
) As Long

Declare Function ScriptStringValidate Lib "usp10.dll" ( _
ByVal ssa As Long _
) As Long

Declare Function ScriptStringXtoCP Lib "usp10.dll" ( _
ByVal ssa As Long, _
ByVal iX As Long, _
piCh As Long, _
piTrailing As Long _
) As Long

'---------------------
'   Wrappers around several Uniscribe functions that allow slightly
'   more friendly VB interaction
'
'   ScriptStringFreeC
'   ScriptString_pcOutCharsC
'   ScriptString_pSizeC
'   ScriptString_pLogAttrC
'   ScriptStringAnalyseC
'   ScriptStringCPtoXC
'   ScriptStringXtoCPC
'
'   ScriptIsComplex
'---------------------
Public Function ScriptStringFreeC(ssa As Long) As Long
    If ssa <> 0 Then
        ScriptStringFreeC = ScriptStringFree(ssa)
        ssa = 0&
    End If
End Function

Public Function ScriptString_pcOutCharsC(ssa As Long) As Long
Dim pcch As Long
    pcch = ScriptString_pcOutChars(ssa)
    If pcch <> 0 Then
        CopyMemory ScriptString_pcOutCharsC, ByVal pcch, Len(pcch)
    End If
End Function
Public Function ScriptString_pSizeC(ssa As Long) As OleTypes.Size
Dim psiz As Long
    psiz = ScriptString_pSize(ssa)
    If psiz <> 0 Then
        CopyMemory ScriptString_pSizeC, ByVal psiz, Len(ScriptString_pSizeC)
    End If
End Function
Public Sub ScriptString_pLogAttrC(ssa As Long, cch As Long, rgsla() As SCRIPT_LOGATTR_VB)
Dim prgtsla As Long
Dim rgtsla() As SCRIPT_LOGATTR
Dim itsla As Long
Dim byt As Byte

    ' Call Uniscribe to get the LogAttr info
    prgtsla = ScriptString_pLogAttr(ssa)

    If prgtsla <> 0 Then
        ' Success! Lets put the pointer into a struct and prepare some memory
        ReDim rgtsla(0 To cch - 1)
        CopyMemory rgtsla(0), ByVal prgtsla, CLng(Len(rgtsla(0)) * cch)
        ReDim rgsla(0 To cch - 1)

        ' Convert the unfriendly C type into a friendly VB type that can be used elsewhere
        For itsla = 0 To cch - 1
            byt = rgtsla(itsla).fBitFields
            With rgsla(itsla)
                .fSoftBreak = RightShift((byt And &H1), 0)
                .fWhiteSpace = RightShift((byt And &H2), 1)
                .fCharStop = RightShift((byt And &H4), 2)
                .fWordStop = RightShift((byt And &H8), 3)
                .fInvalid = RightShift((byt And &H10), 4)
                .fReserved = RightShift((byt And &HE0), 5) ' &HE0 = (2 ^ 5 Or 2 ^ 6 Or 2 ^ 7)
            End With
        Next itsla
        Erase rgtsla
    End If
End Sub
Public Function ScriptStringAnalyseC( _
hdc As Long, stAnalyse As String, cch As Long, _
ByVal dwFlags As SSA_FLAGS, iReqWidth As Long, _
Optional vSCV As Variant, Optional vSSV As Variant, _
Optional vST As Variant) As Long
Dim ssa As Long
Dim sc As SCRIPT_CONTROL
Dim ss As SCRIPT_STATE
Dim st As SCRIPT_TABDEF
    If Not IsMissing(vSCV) Then
        sc.uDefaultLanguage = vSCV.uDefaultLanguage
        sc.fBitFields = _
                            LeftShift(vSCV.fContextDigits, 0) Or _
                            LeftShift(vSCV.fInvertPreBoundDir, 1) Or _
                            LeftShift(vSCV.fInvertPostBoundDir, 2) Or _
                            LeftShift(vSCV.fLinkStringBefore, 3) Or _
                            LeftShift(vSCV.fLinkStringAfter, 4) Or _
                            LeftShift(vSCV.fNeutralOverride, 5) Or _
                            LeftShift(vSCV.fNumericOverride, 6) Or _
                            LeftShift(vSCV.fLegacyBidiClass, 7)
    End If

    If Not IsMissing(vSSV) Then
        ss.fBitFields1 = _
                            LeftShift(vSSV.uBidiLevel, 4) Or _
                            LeftShift(vSSV.fOverrideDirection, 5) Or _
                            LeftShift(vSSV.fInhibitSymSwap, 6) Or _
                            LeftShift(vSSV.fCharShape, 7)
        ss.fBitFields2 = _
                            LeftShift(vSSV.fDigitSubstitute, 0) Or _
                            LeftShift(vSSV.fInhibitLigate, 1) Or _
                            LeftShift(vSSV.fDisplayZWG, 2) Or _
                            LeftShift(vSSV.fArabicNumContext, 3) Or _
                            LeftShift(vSSV.fGcpClusters, 4)
    End If

    If Not IsMissing(vST) And ((dwFlags And SSA_TAB) = SSA_TAB) Then
        st.cTabStops = vST.cTabStops
        st.iScale = vST.iScale
        st.pTabStops = vST.pTabStops
        st.iTabOrigin = vST.iTabOrigin
    End If

    If ScriptStringAnalyse(hdc, StrPtr(stAnalyse), cch, 0, -1, dwFlags, iReqWidth, sc, ss, ByVal 0&, st, ByVal 0&, ssa) = S_OK Then
        ScriptStringAnalyseC = ssa
    End If
End Function
Public Function ScriptStringCPtoXC(ssa As Long, iCP As Long, fTrailing As BOOL) As Long
Dim pX As Long
    If ScriptStringCPtoX(ssa, iCP, fTrailing, pX) = S_OK Then
        ScriptStringCPtoXC = pX
    End If
End Function
Public Function ScriptStringXtoCPC(ssa As Long, ByVal iX As Long, piTrailing As BOOL) As Long
Dim piCh As Long
    If ScriptStringXtoCP(ssa, iX, piCh, piTrailing) = S_OK Then
        ScriptStringXtoCPC = piCh
    End If
End Function
Public Function ScriptIsComplexC(stIn As String, Optional Flags As SCRIPT_IS_COMPLEX_FLAGS) As Boolean
Dim hr As Long

    hr = ScriptIsComplex(StrPtr(stIn), Len(stIn), Flags)
    If hr = S_OK Then
        ScriptIsComplexC = True
    ElseIf hr = S_FALSE Then
        ScriptIsComplexC = False
    Else
        Err.Raise hr
    End If
End Function
Public Function ScriptRecordDigitSubstitutionC(Locale As Long) As SCRIPT_DIGITSUBSTITUTE
Dim psds As SCRIPT_DIGITSUBSTITUTE

    If ScriptRecordDigitSubstitution(Locale, psds) = S_OK Then
        ScriptRecordDigitSubstitutionC = psds
    End If
End Function

'---------------------
'   IchNext/IchPrev
'
'   Takes a SCRIPT_STRING_ANALYSIS and a character position and
'   returns the next or previous character position or word position, taking
'   Uniscribe "clusters" into account
'---------------------
Public Function IchNext(ssa As Long, ByVal ich As Long, fWord As Boolean) As Long
Dim cch As Long
Dim rgsla() As SCRIPT_LOGATTR_VB
    cch = ScriptString_pcOutCharsC(ssa)
    Call ScriptString_pLogAttrC(ssa, cch, rgsla())
    Do Until ich >= cch - 1
        ich = ich + 1
        If (rgsla(ich).fCharStop And Not fWord) Then Exit Do    ' We are at the end of a character
        If (rgsla(ich).fWordStop And fWord) Then Exit Do    ' We are at the end of a word
    Loop
    If ich > cch - 1 Then ich = cch ' Take care of the boundary cases
    IchNext = ich
End Function
Public Function IchPrev(ssa As Long, ByVal ich As Long, fWord As Boolean) As Long
Dim cch As Long
Dim rgsla() As SCRIPT_LOGATTR_VB
    If ich > 0 Then ' Make sure we are at the beginning of the string already
        cch = ScriptString_pcOutCharsC(ssa)
        Call ScriptString_pLogAttrC(ssa, cch, rgsla())
        Do Until ich <= 0
            If (rgsla(ich).fCharStop And Not fWord) Then Exit Do    ' We are at the end of a character
            If (rgsla(ich).fWordStop And fWord) Then Exit Do    ' We are at the end of a word
            ich = ich - 1
        Loop
    End If
    If ich < 0 Then ich = 0 ' Take care of the boundary cases
    IchPrev = ich
End Function

'---------------------
'   IchBreakSpot
'
'   Find the appropriate place to break for this line. Here
'   is the algorithm used:
'
'   1) If all text will fit or no line breaking is specified, then output the whole string
'   2) If #1 is not true, find the first hard break within the text that could fit on the line
'   3) If #2 could not be found, then look for the last softbreak or whitespace within the text that could fit on the line.
'   4) If #3 is a whitespace, then break AFTER the character
'   5) If #3 is a soft break, than break before the character
'---------------------
Public Function IchBreakSpot(st As String, rgsla() As SCRIPT_LOGATTR_VB, cch As Long, Optional fNoLineBreaks As Boolean = False) As Long
Dim ich As Long

    ' First check for a hard break
    ich = InStr(1, st, vbCrLf, vbBinaryCompare) - 1
    If ich >= 0 And ich <= cch - 1 Then
        ' Use the hard break that was found
        IchBreakSpot = ich
    ElseIf Len(st) > cch Then
        ' There are more characters then there is space to output, on this line
        ' at least. So walk the string backwards, looking for a break character.
        For ich = cch - 1 To 0 Step -1
            With rgsla(ich)
                ' Check to see if its a soft break char or a white space char
                If .fWhiteSpace Or .fSoftBreak Then
                    If .fWhiteSpace Then
                        ' White space means break AFTER this character
                        IchBreakSpot = ich
                    ElseIf ich > 0 Then
                        ' Its a softbreak. If we have the characters to spare,
                        ' subtract one because we should be breaking BEFORE
                        ' the character, not AFTER.
                        IchBreakSpot = ich - 1
                    Else
                        ' There are not enough chars to go after. This probably should
                        ' never happen, but we may as well make sure.
                        IchBreakSpot = 0
                    End If
                    Exit For
                End If
            End With
        Next ich
    End If

    ' Assume cch is where its at if it has never been set
    If IchBreakSpot = 0 Then IchBreakSpot = cch
End Function

'---------------------
'   UniscribeExtTextOutW
'
'   The Uniscribe-aware version of ExtTextOutW
'---------------------
Public Function UniscribeExtTextOutW(hdc As Long, wOptions As ETOFlags, lpRect As RECT, ByVal st As String, Optional x1 As Long = 0, Optional x2 As Long = 0) As Long
On Error Resume Next
Dim ssa As Long
Dim xWidth As Long
Dim cch As Long
Dim ichBreak As Long
Dim siz As Size
Dim rgsla() As SCRIPT_LOGATTR_VB
Dim rct As RECT

    ' deep copy the rect since may be modifying it
    rct.Left = lpRect.Left
    rct.Right = lpRect.Right
    rct.Top = lpRect.Top
    rct.Bottom = lpRect.Bottom

    xWidth = rct.Right - rct.Left

    ' Keep going till all of the string is done
    Do Until Len(st) = 0
        ssa = ScriptStringAnalyseC(hdc, st, Len(st), SSA_GLYPHS Or SSA_FALLBACK Or SSA_CLIP Or SSA_BREAK, xWidth)
        If ssa <> 0 Then
            cch = ScriptString_pcOutCharsC(ssa)
            Call ScriptString_pLogAttrC(ssa, cch, rgsla())

            ' Get the appropriate break point for this line (see comments in
            ' IchBreakSpot for a better understanding of "appropriate"
            ' CONSIDER: MULTILINE: To support multiple lines, the fNoLineBreaks flag
            ' below would have to be set to False. The rest of the function depends on it!
            ichBreak = IchBreakSpot(st, rgsla(), cch, True)

            ' Free up the analysis, we need to do it again with the new break info
            Call ScriptStringFreeC(ssa)

            ' reanalyze the string
            ssa = ScriptStringAnalyseC(hdc, st, ichBreak, SSA_GLYPHS Or SSA_FALLBACK Or SSA_CLIP Or SSA_BREAK, xWidth)
            If ssa <> 0 Then
                siz = ScriptString_pSizeC(ssa)
                cch = ScriptString_pcOutCharsC(ssa)

                ' Output the string, now that we have done all the preparation
                Call ScriptStringOut(ssa, rct.Left, rct.Top, wOptions, rct, x1, x2, BOOL_FALSE)

                ' Remove the portion of the string that has been output and adjust the rect
                ' for the next line
                st = Mid$(st, cch + 1)
                rct.Top = rct.Top + siz.cy
            End If
            ' Free up the analysis, we need to (so we can do the next one)!
            Call ScriptStringFreeC(ssa)
        End If
    Loop
End Function

'-----------------------
' LeftShift
'
'   Since VB does not have a left shift operator
'   LeftShift(8,2) is equivalent to 8 << 2
'-----------------------
Public Function LeftShift(ByVal lNum As Long, ByVal lShift As Long) As Long
    LeftShift = lNum * (2 ^ lShift)
End Function

'-----------------------
' RightShift
'
'   Since VB does not have a right shift operator
'   RightShift(8,2) is equivalent to 8 >> 2
'-----------------------
Public Function RightShift(ByVal lNum As Long, ByVal lShift As Long) As Long
    RightShift = lNum \ (2 ^ lShift)
End Function

posted Monday, June 12, 2006 3:15 PM by michkap | 0 Comments
Filed Under: Unicode/standards, Int'l Programming, Fonts/Typography

Online persona and navel gazing

Years ago, I had a friend who later admitted they first 'met' me in the virtual world (at that point it was CompuServe forums), and their initial interest was to meet the person behind the online persona I had -- to see if I was actually that person. She and I actually ended up dating for a while after we met in real life, so I guess it was mostly me. :-)

I often describe my online personality or persona as being "exactly like me, only more so".

By and large I think that has been pretty consistently true, though blog posts have added a new wrinkle to the question, since in individual posts I am showing different facets of me. I look at some of the posts I have written and realize in retrospect that I have actually been experimenting with different personas in some of them. This can actually work against me since it does lack a bit of the consistency that many readers would probably prefer.

And I know that certain kinds of posts get more comments than others, and certain kinds get more page hits based on either links from others or some type of (for lack of a better word) 'searchability'. But I have been resistant to trying to change what I post based on those types of metrics -- I'd rather have 10 readers who were going to enjoy the whole blog than a fragmented list of subscribers who ignore most of it. I have never gotten anything but grief from having higher visibility.... :-)

I am actually reminded of a shtick from Bob 'Bobcat' Goldthwait from seeing him live (I think it also showed up on his album, too):

Guy who walked up to Bobcat: Bobcat! Hey, I used to like you! You used to be funny!
Bobcat: Yeah, well, I just met you, and you suck.

It is something I have actually experienced a few times over the last 18 months, though I have not had the nerve to pick up his line. Maybe I should; I mean, you may have really not liked as particular post, but I can say the same. And (out of the 1128 posts I've done to date including this one) I can only think of two that I can sincerely say I wish I had not posted, and both of those were due to the fact that they have been so badly misunderstood (with negative consequences for me).

The other problem I have had is people commenting (or even mildly complaining!) about the fact that they like certain kinds of posts that are in a minority here (like stuff about music I have been listening to or seeing live, or more personal stuff). Now the balance (or lack thereof) for these issues mirrors my actual life to some extent, which means that I am (unfortunately) not nearly as interesting as some of my experiences would indicate -- the ratio of backstage conversations with the likes of Kathleen Edwards to geeky code inspirations/internationalization issues that I find interesting is not nearly impressive enough to make me seem cool....

I am embarrassed to admit that I have ducking John Stewart (not the Daily Show guy, the Infinitely Blue guy!) since he is wondering in email if he will be seeing me on Aimee Mann's latest tour and I am sad to say that none of the dates really seem to line up this time. Though the San Fransisco gig next weekend is close enough to be tempting, and I do have the frequent flyer miles to make that show the cost of a BART ticket (at gimp rates, no less!)....

(By the way, I just noticed that Aimee's new site no longer seems to directly link to her old site that linked to me, and I really embarrassed to admit how disappointed I was about that, even though I originally was embarrassed that the link was there at all, when it was her main site!)

I've had two cats pass away since I started the blog, and one was very public here while the other I told no one about, and I can't honestly say that either of the two approaches made the experience any easier than the other.

Unfortunately for some people, posts like this one will still likely be very rare. Even though they probably generate more email than any other kind, they are much harder to write and it is hard (for me) to be hosted on MSDN Blogs and have too many of these wandering navel gazer posts showing up...

posted Sunday, June 11, 2006 11:51 PM by michkap | 1 Comments
Filed Under: Potpourri

Death of a Data Access Page Wizard

After Clint posted Death of an old friend -- Data Access Pages, I do not feel quite as guilty that I have not really done anything with the TSI Form/Report to Data Access Page Wizard since Access 2002 (I have been told it mostly works in Access 2003 but it has not been tested there so the tester's axioms and the meaning of the word unsupported have relevance).

I guess you can now say with the technology no longer functioning in new versions of the product that it is double secret unsupported!

TRUE STORY: A few years back the Access team offered to buy the wizard and I offered to sell it at cost (I paid someone else to do all but the internationalizing of it), but the deal never really worked out for reasons that probably would make a fascinating blog post if I ever left Microsoft; for now you'll just have to use your imagination. :-)

So farewell to the wizard, which managed to double my old site's traffic to half a million hits a month after the international Office Update sites chose to link to this little wizard that was localized into more languages than Access itself is....

posted Sunday, June 11, 2006 11:58 AM by michkap | 0 Comments
Filed Under: Locales/Cultures, Potpourri

Why the Windows Shell can't provide the ultimate font solution for everyone (or even anyone!)

A few days ago when I put up the post Is this the Über-font post? No, but it is the teaser for it!, I did sort of promise that I'd deliver on a post explaining about the issues related to fonts on Windows that plague so many application developers, whether they are working on the Windows team, inside of Microsoft, or in the wider world....

Well, today is the first part of that!

To start with, I'll say that DEFAULT_GUI_FONT, MS Shell Dlg, and MS Shell Dlg 2 have one elemental goal in common -- at various times in the life of Windows they were implemented to help give a consistent look and feel to Windows applications running in the Windows Shell.

Now the Windows Shell has some pretty specific, historical characteristics that are relevant here -- like the fact that pretty much all of the UI in Windows (prior to LIPs at least!) would be in a single language. And with a single language kind of makes sense to be trying to use a single font, right? This whole functionality can be thought of as being user interface language based.

Okay, so this functionality is potentially very useful for stuff in the Windows Shell, from control panel applets to the Start menu and so on. But applications like Media Player, MS Office applications, Visual Studio, or other apps inside of Microsoft tend to follow their own look and feel. Which means that in many cases none of those "Shell" fonts are the best choice for these other applications (especially if the list of UI languages they support is different, since it means there is a real possibility that the languages being covered may not be compatible.

What is worse, everything outside of Microsoft has the same issues, but even more so -- because with the possible exception of Shell add-ins, they are even less likely to have a need or even a desire to have a completely consistent language, look and feel with Windows.

The other problem that came up over time is that no one wanted to modify these fonts or their definitions over time, and between versions. Because there would be subtle differences in the metrics when comparing Microsoft Sans Serif to Tahoma to Segoe UI to MS Sans Serif and so on. And that means that if you change the definition of the undrlying font technology used by the Shell that dialogs might start clipping text or looking inorrect.

Which sort of explains why there is no brand new MS Shell Dlg 3 that would support the new Segoe UI. Because after multiple versions of realizing how bad it would be, it has been, and it is to change the definitions of these built-in special fonts, and since the hit to developers of changing to use a different special font for each platform version is roughly equivalent to them having to change to a normal font for each platform version, there really is very little point to having a new MS Shell Dlg # in each new version of Windows that is unknown to prior versions.

Especially when applications whose font changes yet no UI review is done may look really awful in the untested font.

In practice, this affects not only the external customer applications and not only the non-Windows apps in Microsoft, but even applications in Windows do not tend to update their fonts universally to the new version. Which is the main reason that Chris Pirillo can find so much inconsistent font usage in posts of his like Windows Vista Feedback. I know some may consider it nitpicky, but I think Chris has provided the ultimate proof that the current scheme does not even handle the limited scenario of the consistent UI in the Windows Shell all that effectively. So how could it ever hope to help everyone else?

Now while the Shell folks were working on providing a way for Windows Shell applications (which as the above indicates has many flaws, by the way!), the folks in Microsoft Typography and on the GDI/Uniscribe text services teams were working hard to solve a very different problem -- how to make sure that whatever font was chosen for controls, that text would be displayable.

Such a functionality could not be tied to only the UI language; it would need to be able to support languages/scripts outside the narrow scope of the UI language if Windows language support was going to be truly international. Though of course there are problems with these solutions as well, which will be the subject of a future post....

This post brought to you by ༃ (U+0f03, a.k.a. TIBETAN MARK GTER YIG MGO -UM TSHEG MA)

posted Sunday, June 11, 2006 2:45 AM by michkap | 6 Comments
Filed Under: Locales/Cultures, Int'l Programming, Fonts/Typography

Cut off from society?

(Nothing technical in this post; it's a private service announcement from Trigeminal Software!)

Sometime in the early morning this last Tuesday, the Internet Service Provider that hosts my virtual domains upgraded their spam filter. This is good as far as it goes, and they have done it many times before. Unfortunately, this time they made some mistake that caused every mail that was sent to me to be treated as spam and quarantined.

Now I own two virtual domains, an one of them has no spam filtering (since I have never given anyone the email address, I have a client side rule that simply deletes all the spam). Unfortunately this caused me to not really notice that my actual email asccount was down.

Then by Friday I was calling them and saying I seemed to be having problems; unfortunately they had been having some other problems with their DNS registrations which they had just fixed so they sorta of assumed it was the same problem (and that I was exagerating how long I had been seeing a problem!). But they assured me they were up and running now.

So it was not until this morning that I was calling them again to say that even in tests I was doing from other email accounts, nothing was getting through.

The person I talked to immediately turned off the spam filter on the account, but he had no administrative privileges to the spam filter, so I do not know whether the quarantined messages are gone forever or whether I will just have 3000+ messages dumped on me on Monday morning.

I may not be home and dry just yet, but perhaps I could be seen as being home and vigorously toweling myself off....

Anyway, if you have sent me email to my non-Microsoft account and have either received no response or a response like this:

Your message has been delayed and is still awaiting delivery to the following recipient(s)

<my email address>

(Was addressed to <my email address>) Message delayed

Could not resolve mail server name because DNS server did not respond in time.

Then you can (if you like) send me another message, or if it was not very important then you can wait until I maybe am buried under an unbelievable mound of it on Monday morning.

Though of course you wil be competing with all of the offers to lower my monthly mortgage (I do not currently have a mortgage) or for Viagra or for a degree from a non-certified university for underwater basket weaving or whatever.

No worries, I am sure I will get to it all eventually. Your patience and understanding is appreciated.... :-)

posted Saturday, June 10, 2006 11:14 PM by michkap | 0 Comments
Filed Under: Potpourri

Is the SendKeys juice worth the squeeze?

There was a funny scene in Superman II where after a ton of oranges Lois Lane has just a tiny bit of actual Orange Juice. And that was even with someone like Superman wielding the juicer. The truth is that the juice is not always worth the squeeze....

So, looking to SendKeys, the SendKey statement has been around in Visual Basic for a long time.

It is at its core an attempt to get around all of the confusion of the difference between WM_KEYDOWN and WM_CHAR. It accomplishes this by shoving them all in one string that the caller passes in and then parsing out what it needs to.

Most people still found it to be fairly confusing, for what it is worth. Some of the most complex problems come out of such attempts at simplification. :-)

While people may primitively try to emulate this behavior by passing their own WM_CHAR (or even WM_KEYDOWN for the more sophisticated among the developer crowd), VB has long accomplished its task in a much more complicated way: via a SetWindowsHookEx function call to create a WH_JOURNALPLAYBACK hook. The journal stuff is described simply:

WH_JOURNALPLAYBACK

Installs a hook procedure that posts messages previously recorded by a WH_JOURNALRECORD hook procedure. For more information, see the JournalPlaybackProc hook procedure.

WH_JOURNALRECORD

Installs a hook procedure that records input messages posted to the system message queue. This hook is useful for recording macros. For more information, see the JournalRecordProc hook procedure.

.NET and its SendKeys class carry on that bold, low level tradition of using the journaling support.

You can dig in further to understand more about it if you like, although to be honest it may not be worth the trouble. As many developers have found, the current incarnation in both VB and in .NET does really strange things in beta builds of IE7 (character duplication, etc.) and actual exceptions in beta builds of Vista due to some of the security changes in the OS.

Of course, most developers only care about security when it affects them; otherwise most of them run as Administrators on their boxes and many of them are finding Vista to be a new experience since even administrators have to approve some of the changes they run (and developers get no free pass to run their code with impunity).

SendKeys was in its own way doing the same sort of thing -- moving to some of the lowest levels to do its work. And because of that, it is now having problems.

Of course people are looking at the problem now and working out the best way to solve the problem, though with a security decision to not allow less privileged windows to talk to windows with higher privileges, it is hard to imagine supporting the random VB application that assumes it can talk to any window it pleases.

So there will have to be some changes for developers, no matter what problems are solved on the .NET or OS side.

I would highly recommend staying away from SendKeys at this point.

Now looking back, every step that SendKeys went through to get where it is was done for sensible reasons. But now, having reached this point it is reasonable to wonder whether the juice was, in fact, worth the squeeze.

In my opinion, it wasn't....

This post brought to you by 𐂓 (U+10093, a.k.a. LINEAR B MONOGRAM B127 KAPO)

posted Saturday, June 10, 2006 5:35 AM by michkap | 0 Comments
Filed Under: Keyboards, Int'l Programming

DEP is not affected by locale settings

When I hear someone has a problem involving VB5/VB6 and strings that involves language issues, I can usually make a good guess at what is going wrong....

So the other day, when Christopher used the contact link to ask me a question, I had a notion of what might be happening:

Hello,

I've got an interesting regional settings situation that I thought you might have some knowledge of (there is a question somewhere in here...). I've got an app that makes use of some fancy VB6 code to dynamically execute some assembly code stored in a string (for CPU info stuff). It had worked fine until XP SP2 and all the various DEP issues. I've now fixed them. Or so I thought.

It seems that the DEP problem is fixed on English setups, but if I switch to Thai or Hebrew (the two that I tested, could be others), I get a DEP warning! By switched, I mean change the setting in both Standards and Formats and the non-Unicode Language field.

I don't run any code that is locale/language/etc specific. It should run the same on all scenarios. So my question is: What in the world goes in there that would cause DEP warnings on Thai/Hebrew, but not in English? Does VirtualAlloc/Protect/Free have any code internally that depends on the regional settings? It seems very odd. Any chance the fantastic Kaplan knows what sort of actions are taking place? :) If there is something weird here, there might be more than a few developers that would want to know about it...

Thanks,

-Christopher J. Thibeault

I don't think I'm fantastic or anything, but like I said I think I know what is going on.... :-)

Now the XPSP2 DEP (Data Execution Prevention) functionality does not itself have language-specific hooks in it. As a feature, it just keeps you from executing code that is actually sitting in a data section of a binary (a "feature" that is unique to x86 and which is used in some of security exploits that have been seen in the past related to buffer overruns on the heap).

But any time you are using strings in VB <= 6.0 and you are not taking special care to avoid it you are going to see a lot of string conversion via the default system code page (which is directly affected by the default system locale, also known as the language for non-Unicode programs.

It boils down to the same problem behind the whole Double Secret Unicode thing I have talked about before. Basically, a VB string being passed to a Win32 API, which VB will automatically convert from and to Unicode using the default system code page, andall of that conversion is basically either (a) useless or (b) destructive to the data.

In this case, where a string is being used in a Curland-esque running of assembler, any operation that is potentially destructive of the string is potentially very dangerous, and can certainly do things that were not intended.

It was a good thing DEP was there to protect the computer, when you think about it!

Of course it is hard to answer more specifically what to do here to fix the problem than what the Double Secret Unicode post talks about, but keeping that principle in mind it is usually easy enough to see where Visual Basic might be doing a bit of that evil conversion on data that ought not to be converted....

This post brought to you by ଏ (U+0b0f, a.k.a. ORIYA LETTER E)

posted Thursday, June 08, 2006 12:01 AM by michkap | 5 Comments
Filed Under: Locales/Cultures, Unicode/standards, Int'l Programming

Reading the boilerplate

How often do we actually read the boilerplate text?

It is funny, I was thinking about this the other day.

I have belonged to Linguist List for a little while now, though quite definitely as a lurker (even if I can upgrade myself from delusions of linguistic aptitude to notions of it, I am still by no means a linguist, even ignoring the many years of education I am missing!).

All kinds of interesting mails go by though, and sometimes there are very interesting post and announements as well.

Though I couldn't help noticing their boilerplate that is prepended to any "job listing" that is sent:

The LINGUIST List strongly encourages employers to engage in non-discriminatory hiring practices. We urge employers not to discriminate on the grounds of race, ethnicity, nationality, age, religion, gender, or sexual orientation. However, we have no means of enforcing these standards.

(the last sentence, in red, is only in the email version, not in that site post I linked to -- putting it in red is just me though, not the email!)

In any case, I can't help feeling like there is an item missing there. One that theoretically impacts me.

Technically this is not such a huge deal for me, since I wouldn't be qualified for any of the positions they list anyway. But I can't help wondering if anyone who was qualified saw this and decided not to respond? It is easy to be an introvert in such cases....

Probably not, though. Like I said, I think I am oversensitive to it. I mean, just about every university I know of has such policies -- including Wayne State University and Eastern Michigan University, the sponsors of Linguist List. So it is probably covered even without the boilerplate disclaimer. :-)

Which actually brings up a bigger question -- how many people actually tune out once the paragraph about non-discrimination starts, since they know what it is going to cover? I mean, even in this one particular case I can't say I noticed an issue even after it appeared on hundreds on mail messages. Perhaps I would have been looking harder if I were hunting for a linguistic job, but it is more likely that I would be skipping past it to look at the actual jobs.

Do people actually ever read the boilerplate text?

posted Thursday, June 08, 2006 12:00 AM by michkap | 2 Comments
Filed Under: Linguistic, Multiple Sclerosis

Yi Syllables are totally Radical, dude!

(This is also not the font post; just hang in there, it will be here soon!)

Yi is one of the minority languages of China. The Liangshan Yi script was devised in the mid 70's and the standard was pushed out to the world in 1980 (a fuller description of the script in Unicode can be seen at this Babelstone article).

It is one of the scripts that is supported in Vista with a locale, an input method, and a font named Microsoft Yi Baiti. Which is very cool. :-)

So anyway, in Unicode the script has these two blocks:

Yi Syllables (U+a000 -- U+a48f)

Yi Radicals (U+a490 -- U+a4Cf)

The first block is the one that is actually used for the language; the second block really has no specific defined use outside of dictionary-type headers or index entries.

Because of this, the two are generally not collated together (with radicals interleaved with syllables) -- similar to the way Latin is not interleaved with Han in Simplified Chinese sorts based on Pinyin and Bopomfo is not interleaved with Han in Traditional Chinese pronunciation sorts used in Taiwan.

Of course there is still room for confusion, if you look across all of both ranges there are a few that look the same (on the left is the Yi Radical, on the right is the Yi Syllable):

U+a49c  ꒜   U+a0c0 ꃀ

U+a4a8  ꒨   U+a132 ꄲ

U+a49a  ꒚   U+a1d9 ꇙ

U+a4bf  ꒿   U+a259 ꉙ

U+a494  ꒔   U+a2cd ꋍ

U+a4c2  ꓂   U+a3b5 ꎵ

U+a4b0  ꒰   U+a3c2 ꏂ

U+a4a7  ꒧   U+a458 ꑘ

It is easy to imagine grabbing the wrong one (i.e. the radical rather than the syllable) if it is easy enough to do so.

Now this makes no difference for simply looking at text, but when trying to search within it or sort it, you could run across a real problem -- since in collation (e.g. in the Unicode Collation Algorithm) all of the radicals are put together in a separate weight space from where the syllables are.

Of course one could:

take these eight radicals and treat them differently than the other 47 of them, or
take those eight syllables and treat them differently than the other 1189, or
arbitrarily stick all 55 radicals in the middle of the syllables in ways not really defined, even though this would cause them to be unadvoidably out of order in at least one case

but each of these solutions would come at the price of making some other behavior seem incorrect.

In the end, the key would be to just not use the Yi Radicals when one should be using the Yi Syllables (a solution probably best handled within the input method rather than within the font or the collation).

This post brought to you by ꀕ (U+a015, a.k.a. YI SYLLABLE WU)
(star in an upcoming Unicode character story!)

posted Wednesday, June 07, 2006 10:14 AM by michkap | 4 Comments
Filed Under: Keyboards, Locales/Cultures, Linguistic, Unicode/standards, Int'l Programming, Fonts/Typography

Performance issues with language specific sorts?

(No, this is not the font post!)

Bryan Murtha asks:

I've read your blog and I was hoping to get out of it, just how do you setup an internationalized SQL Server database. I read all the docs and the International Software book from Microsoft. Ok, use NVarchar and XML, and your good on the storage. Nowhere on the web or MSDN or anywhere, does it cover just how to get across the inevitable performance issues of not being able to implement language specific sorts. Other then use Oracle is there an answer to this problem?

Unfortunately, I have to reject the premise of the question, in particular the piece that talks about the performance issues with language specific sorts, given the info in Handling multilingual data in SQL Server and in particular the post that links to (Making SQL Server index usage a bit more deterministic). There is definitely a way to make sure that the search is indexed for any desired language sort.... :-)

Now note that you will want to stay away from SQL compatibility sorts, as I pointed out yesterday.

This post brought to you by হ (U+09b9, a.k.a. BENGALI LETTER HA)

posted Wednesday, June 07, 2006 12:01 AM by michkap | 2 Comments
Filed Under: Collation/Casing, Unicode/standards, Int'l Programming

Is this the Über-font post? No, but it is the teaser for it!

What do this question from flyingxu:

I'm an MFC programmer. When I try to write some app for Chinese or Korean, I find many controls' size are changed by theire font. I mean, when I change the font, child dialog's size change and the whole dialog's layout is changed, which may lead to some overlap or empty space gap. It's big headache for me now.

and this question from Steffen (a.k.a. The SZ):

In your "What about logical fonts? " post you are writing much about shell dlg and shell dlg 2 and stuff. There are also a lot of other stuff outthere, but i still don't know what I should do. My requirements are pretty simple:
* windows 95 to vista support in single binary
* best ui experience which is available on this platform
* different dpi setting support
Thats all.

Currently I'm using FindResourceEx and mofiy the font name to "MS Shell Dlg 2" if available, if not using "MS Shell Dlg".

Why supports "MS Shell Dlg" not different dpi settings? (120dpi) (The font is not getting larger)

Will vista finally map "MS Shell Dlg 2" to "Segoe UI"? Or do I have to do this?

Steffen

and this question from Matt:

On my dev box I’m running XP and the default font for my winforms is “Microsoft Sans Serif, 8.25pt” and it appears to match the font of other apps such as Excel and Outlook. When my app is installed on a box running Vista, the font no longer matches Vista’s default font of Segoe.

What are the recommendations for an app to match the system font defaults? Any links to code samples?

and the following blog posts:

and what Mark mentioned:

The problem is not in coming up with the function but in understanding what the rules are across themes and across different language versions of Vista. Does the font change for different themes? Does the font change across language versions?

This, in my experience, has always been the problem, trying to understand what the right thing to do is based on MSDN is virtually impossible. There is a stack of conflicting and (sometimes plain bad) information spread across many different locations. It would be refreshing to see guidance that states clearly what font should be used where.

have in common?

Well, first of all they all talk about fonts, obviously. :-)

What is more important is that there is in the heart of each post, each question, and each comment a common desire.

Simply stated? It is the desire for some magical way to simply not have to worry about choosing a font or its size to get the appropriate display of the text in an application.

So, watch this space because in approximately 24 hours I will post my ultimate response to all of this. You may disagree with it, you may wonder why someone who would post something like this would think he was even qualified to try, you may think it is the coolest thing since sliced bread.

But you'd probably be wrong, it is just gonna be about fonts. :-)

This post brought to you by ʬ (U+02ac, a.k.a. LATIN LETTER BILABIAL PERCUSSIVE)

posted Tuesday, June 06, 2006 2:01 AM by michkap | 6 Comments
Filed Under: Locales/Cultures, Int'l Programming, Fonts/Typography

Unicode and SQL Collations have nothing to do with each other

It was just a few days ago when I was channeling comedies of my youth in the post Je, for sure, from Sweden. And in that post, I responded to Alexey Sadomov's query about unusual behavior in the SQL_SwedishStd_Pref_CP1_CI_AS SQL Server compatibility collation where Unicode columns and non-Unicode columns returned different results, and I did have two theories about the behavior:

Uh oh, what happened? Why does sorting by the non-Unicode column break the Swedish/Finnish sort behavior for the SQL compatibility collation?

Or perhaps it never worked properly in prior versions and the bug is that the newer support of Unicode is where the incompatibility is coming from?

Between the two theories, the second seems a bit more likely to me....

Well, I suppose we could file this one under the heading of truth being stranger than my crackpot theories, because the answer is simple:

For Unicode, we always use NLS code, including SQL collations. So we do get inconsistent sorting results for varchar vs. nvarchar in SQL collations. That’s why we recommend users to use Windows collation for consistency. Though the performance difference prevents us from always recommending it.

The sort table was inherited long time ago from Sybase. I have no idea why particular characters are sorted in a particular way. (Read: nobody here knows)

If nothing else, I now have a good question to ask Ken Whistler of Sybase about. :-)

In the end, it is simple -- if you are storing your data as Unicode, then there really is no good reason to be using one of these old collations. Not even for reasons of compatibility since the sort will not be compatible with the old sort anyway.

So take those way too retro SQL compatibility collations and dump 'em, today. They really are losing any semblance of useful purpose....

This post brought to you by ಌ (U+0c8c, a.k.a. KANNADA LETTER VOCALIC L)

posted Tuesday, June 06, 2006 12:01 AM by michkap | 1 Comments
Filed Under: Collation/Casing, Unicode/standards, Int'l Programming

Is SQL Server really supporting UTF-16?

DanS12345 asked via 'Suggest a topic' the following question:

The documentation for SQL Server 2005 says that xml is stored in UTF-16 encoding. In fact the doc's make a point that xml is stored differently than NCHARs, which use UCS-2. This does not seem to be the case. I have done some experiments that seem to indicate that xml with characters that are not in the BMP are not proprerly handled. SQL Server seems to handle xml the same way it handles regular CHARs and NCHARS. Can you clarify what is going on?

Thanks,
Dan

There are some xml tags in the following, I hope they make it through.

Here is an example using 𠀀 a unicode character from plane 2. I've tried this with many different other non-BMP characters and also loading from a file with the character code in it instead of literal text with an entity and the results are the same. This uses XQuery to measure the length of the string inside the A tag.

DECLARE @xml xml
SET @xml = '<A>𠀀</A>'
SELECT @xml.query('string-length(.)') AS RETURNS

--------
RETURNS
2

i.e. .query sees 𠀀 as 2 characters, not 1. This is how the SQL Server 2005 documents that NCHAR's, i.e. text stored as UCS-2, will behave in SQL Server.

Just to confirm the treatment of UTF outside of SQL Server I used some other XML tools to see what results they produced.

The XQuery that measures string-length when run through Saxon sees only 1 character, as expected

Likewise if <A>𠀀</A> is run through the XSLT program shown below run by Microsoft MSXML (4 or 6), or Saxon 8 also see only 1 character as expected.

<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/A">
<xsl:value-of select="string-length(.)"/>
</xsl:template>
</xsl:stylesheet>

However the same XSLT program run through .NET's System.Xml.Xslt (1.1 or 2.0) sees 2 characters, which seems incorrect.

Here is another XQuery to extract the 𠀀 character. This says that the &#X20000 character is not allowed.

DECLARE @xml xml
SET @xml = N'<A>𠀀</A>'
SELECT @xml.query('<A>{substring(.,1,1)}</A>')

--------
Msg 6309, Level 16, State 1, Line 3
XML well-formedness check: the data for node 'NoName' contains a character (0xD840) which is not allowed in XML.

Here is the same thing with a BMP character, which works as expected.

DECLARE @xml xml
SET @xml = N'<A>A</A>'
SELECT @xml.query('<A>{substring(.,1,1)}</A>')

----------

<A>A</A>

And one more example doing a comparison. This comparison returns false, which isn't true.

DECLARE @xml xml
SET @xml = N'<A>𠀀</A>'
SELECT @xml.query('substring(.,1,1) eq "𠀀"')

-------
false

I can sort of understand this last example; it implies it is treating the parameter of the .query function as an ordinary string rather than xml. But that would not be a correct way to handle this.

So what's going on here, is SQL Server really using UTF-16 for xml or is it just storing xml as a plain ol' NCHAR, or is there some other fine details about managing UTF that is not obvious.

There is a bit of confusion here -- though not on Dan's part. It is on the part of a large and complex project (SQL Server) that has some components with full support of UTF-16 and some with no support of UTF-16, and most of them in between those two extremes.

I'll try and separate some of this out now. :-)

A claim of UTF-16 support is not the same thing as a claim of NCR support of supplementary characters.

What it means is that U+20000, if stored in UTF-16 (which means underneath it will be U+d840 U+dc00), can be stored. There is no implied promise that it will parse the string to convert NCRs into characters just because they are being stored (in truth it will sometimes be parsed and sometimes not depending on the component).

The reason NCRs exist in both HTML and XML is to allow either the parsing or the storage of text outside of the encoding being used -- in this case to store UTF-32 inside of text that is not encoded in UTF-32.

On the other hand, look at the error message that came back in one of those cases:

Msg 6309, Level 16, State 1, Line 3
XML well-formedness check: the data for node 'NoName' contains a character (0xD840) which is not allowed in XML.

which of course makes it clear that in some cases it does indeed do the parsing of NCRs.

I believe the trouble here is that often XML in SQL Server is processed as UTF-8 rather than UTF-16, and it is illegal to use surrogate code units inside of UTF-8. However, since you put it in as UTF-16 text (albeit with a UTF-32 NCR) you managed to find a bug -- a point where it converted the NCR to UTF-16 but then converted it all to UTF-8 as is.

I probably would not have made the claim in the first place that SQL Server 2000 or SQL Server 2005 supported much beyond UCS-2, with some minor support hooks there for UTF-16 that happen due to certain components being a bit further along in their evolution

The length issue is kind of the same -- the length of a string in NCHARs of a single supplementary character is expected to be 2, not 1. This is true whether SQL Server supported UTF-16 or not (posts like this one discuss the length issue in more detail).

In the end, I would call SQL Server more "surrogate neutral" than "surrogate supporting". :-)

Though this, so many years after the surrogate mechanism was first added to Unicode in version 2.1, is bad enough for me to mark this post on the "Unicode Lame List". It is time to get this fixed, I think....

This post brought to you by ㄛ (U+311b, BOPOMOFO LETTER O)
(A character listed in unihan as being derived from U+20000, and there is some slight resemblence!)

posted Monday, June 05, 2006 6:54 AM by michkap | 0 Comments
Filed Under: Collation/Casing, Unicode/standards, Int'l Programming, Unicode Lame List

More on case insensitivity and its intuitivality

Yesterday in 'Intuitivosity (intuitivality?) of case insensitivity' I talked a little a bit about some of the limitations in the Mac OS-X implementation of case insensitivity.

I realized I needed to actually point out a little more information here to paint the full picture....

On the surface this would suggest that Apple's OS only handles the casing of ASCII letters, but that would be misleading due to several related facts:

They support Unicode names (cf: Apple TN2078)
Those names are stored in Unicode Normalization form D (cf: UAX #15: Unicode Normalization Forms and Apple TN2078)
They claim to sometimes support case insensitivity, but not always (cf: File System Guidelines)

Now I do not have a Mac or anything, but assuming the first two points are 100% true and third point is just a little misleading based on the the stuff Geoff noted that I talked about yesterday (i.e. the case insensitivity is incomplete rather than supported fully by some file systems/not at all by others).....

#1 The OS X file system is potentially much cooler than NTFS on Windows since it does Unicode normalization.

#2 The OS X file system is potentially much lamer than NTFS on Windows if the only case insensivity is ASCII A-Z ~ a-z.

#3 The OS X file system potentially mitigates Point #2 above for the Latin script since it fully decomposes everything and the "cased" portion of every Latin script letter is handled in a much smaller table than Windows has to cover the same area.

#4 NTFS on Windows has a slightly higher maintainable burden since it has all of the characters in its casing tables since it would have to potentially update its tables more often to handle new Unicode characters that are added.

#5 NTFS on Windows sees an effective mitigation in the fact that Unicode no longer adds precomposed character that can be decomposed into already existing sequences that are canonically equivalent, meaning this advantage would not apply to future versions.

#6 If the OS X casing truly only handles ASCII A-Z then there is actually 3.6 buttloads of characters that it is not properly handling across other cased scripts from Cyrillic to Greek to Coptic to Armenian to Georgian. Note: since I honestly do not know if this is the case, I hesistate to say whether Apple is lame on this point or not -- they may be just fine. Does anyone know?

#7 NTFS on Windows is technically not case insensitive at its lowest levels, so in theory it has the same problem Geoff noted with some of the tools that are available on the Mac. The difference is that architecturally there is no way to get at this mode from within Win32, so Windows has a more complete wrapper that avoids the inconsistency that can be seen on Apple's platform....

Ok, those are all the useful conclusions I was able to think of at the moment. :-)

This post brought to you by Ə and ə (U+018f and U+0259, a.k.a. LATIN CAPITAL LETTER SCHWA and LATIN SMALL LETTER SCHWA)

posted Monday, June 05, 2006 12:01 AM by michkap | 10 Comments
Filed Under: Collation/Casing, Unicode/standards, Int'l Programming, Unicode Lame List

Sorting It All Out

Search

Archives

News

About Me

Terror Alert Level

Disclaimer

Post Categories

Administrivia

Blogs (i18n)

Blogs (other)

Blogs (inactive)

MSFT Links

Non-MSFT Links

Rebuilding MFC and the CRT with MSLU

Regional Options