GoogleSQL for Spanner supports collation. You can learn more about collation in this topic.
About collation
Collation determines how strings are sorted and compared in an ORDER BY
operation. If you would like to use custom collation in the operation, you can include the COLLATE
clause with a collation specification.
Where you can assign a collation specification
In the ORDER BY
clause, you can specify a collation specification for a collation-supported column. This overrides any collation specifications set previously.
For example:
SELECT Place
FROM Locations
ORDER BY Place COLLATE "und:ci"
Query statements
Type | Support |
---|---|
Sorting | ORDER BY clause |
Collation specification details
A collation specification determines how strings are sorted and compared in collation-supported operations. You can define a collation specification for collation-supported types. These types of collation specifications are available:
If a collation specification isn't defined, the default collation specification is used. To learn more, see the next section.
Default collation specification
When a collation specification isn't assigned or is empty, the ordering behavior is identical to 'unicode'
collation, which you can learn about in the Unicode collation specification.
Unicode collation specification
collation_specification:
'language_tag[:collation_attribute]'
A unicode collation specification indicates that the operation should use the Unicode Collation Algorithm to sort and compare strings. The collation specification can be a STRING
literal or a query parameter.
The language tag
The language tag determines how strings are generally sorted and compared. Allowed values for language_tag
are:
- A standard locale string: This name is usually two or three letters that represent the language, optionally followed by an underscore or dash and two letters that represent the region — for example,
en_US
. These names are defined by the Common Locale Data Repository (CLDR). und
: A locale string representing the undetermined locale.und
is a special language tag defined in the IANA language subtag registry and used to indicate an undetermined locale. This is also known as the root locale and can be considered the default Unicode collation. It defines a reasonable, locale agnostic collation. It differs significantly fromunicode
.unicode
: Returns data in Unicode code point order, which is identical to the ordering behavior whenCOLLATE
isn't used. The sort order will look largely arbitrary to human users.
The collation attribute
In addition to the language tag, the unicode collation specification can have an optional collation_attribute
, which enables additional rules for sorting and comparing strings. Allowed values are:
ci
: Collation is case-insensitive.cs
: Collation is case-sensitive. By default,collation_attribute
is implicitlycs
.
If you're using the unicode
language tag with a collation attribute, these caveats apply:
unicode:cs
is identical tounicode
.unicode:ci
is identical tound:ci
. It's recommended to migrateunicode:ci
tobinary
.
Collation specification example
This is what the ci
collation attribute looks like when used with the und
language tag in the ORDER BY
clause:
SELECT Place
FROM Locations
ORDER BY Place COLLATE 'und:ci'
Caveats
Differing strings can be considered equal. For instance,
ẞ
(LATIN CAPITAL LETTER SHARP S) is considered equal to'SS'
in some contexts. The following expressions both evaluate toTRUE
:COLLATE('ẞ', 'und:ci') > COLLATE('SS', 'und:ci')
COLLATE('ẞ1', 'und:ci') < COLLATE('SS2', 'und:ci')
This is similar to how case insensitivity works.
In search operations, strings with different lengths could be considered equal. To ensure consistency, collation should be used without search tailoring.
There are a wide range of unicode code points (punctuation, symbols, etc), that are treated as if they aren't there. So strings with and without them are sorted identically. For example, the format control code point
U+2060
is ignored when the following strings are sorted:SELECT * FROM UNNEST([ 'oran\u2060ge1', '\u2060orange2', 'orange3' ]) AS fruit ORDER BY fruit COLLATE 'und' /*---------* | fruit | +---------+ | orange1 | | orange2 | | orange3 | *---------*/
Ordering may change. The Unicode specification of the
und
collation can change occasionally, which can affect sorting order. If you need a stable sort order that's guaranteed to never change, useunicode
collation.