Locale specifies the user's language, country, and any special variant preferences, such as collation. Vertica uses locale to determine the behavior of certain string functions. Locale also determines the collation for various SQL commands that require ordering and comparison, such as aggregate GROUP BY and ORDER BY clauses, joins, and the analytic ORDER BY clause.
The default locale for a Vertica database is en_US@collation=binary (English US). You can define a new default locale that is used for all sessions on the database. You can also override the locale for individual sessions. However, projections are always collated using the default en_US@collation=binary collation, regardless of the session collation. Any locale-specific collation is applied at query time.
If you set the locale to null, Vertica sets the locale to en_US_POSIX. You can set the locale back to the default locale and collation by issuing the vsql meta-command \locale. For example:
Note
=> set locale to '';
INFO 2567: Canonical locale: 'en_US_POSIX'
Standard collation: 'LEN'
English (United States, Computer)
SET
=> \locale en_US@collation=binary;
INFO 2567: Canonical locale: 'en_US'
Standard collation: 'LEN_KBINARY'
English (United States)
=> \locale
en_US@collation-binary;
Vertica locale specifications follow a subset of the Unicode LDML standard as implemented by the ICU library.
1 - Locale handling in Vertica
The following sections describes how Vertica handles locale.
The following sections describes how Vertica handles locale.
Session locale
Locale is session-scoped and applies only to queries executed in that session. You cannot specify locale for individual queries. When you start a session it obtains its locale from the configuration parameter DefaultSessionLocale.
Query restrictions
The following restrictions apply when queries are run with locale other than the default en_US@collation=binary:
When one or more of the left-side NOT IN columns is CHAR or VARCHAR, multi-column NOT IN subqueries are not supported . For example:
=> CREATE TABLE test (x VARCHAR(10), y INT);
=> SELECT ... FROM test WHERE (x,y) NOT IN (SELECT ...);
ERROR: Multi-expression NOT IN subquery is not supported because a left
hand expression could be NULL
Note
Even if columns test.x and test.y have a NOT NULL constraint, an error occurs.
If the outer query contains a GROUP BY clause on a CHAR or VARCHAR column, correlated HAVING clause subqueries are not supported. In the following example, the GROUP BY x in the outer query causes the error:
=> DROP TABLE test CASCADE;
=> CREATE TABLE test (x VARCHAR(10));
=> SELECT COUNT(*) FROM test t GROUP BY x HAVING x
IN (SELECT x FROM test WHERE t.x||'a' = test.x||'a' );
ERROR: subquery uses ungrouped column "t.x" from outer query
Subqueries that use analytic functions in the HAVING clause are not supported. For example:
=> DROP TABLE test CASCADE;
=> CREATE TABLE test (x VARCHAR(10));
=> SELECT MAX(x)OVER(PARTITION BY 1 ORDER BY 1) FROM test
GROUP BY x HAVING x IN (SELECT MAX(x) FROM test);
ERROR: Analytics query with having clause expression that involves
aggregates and subquery is not supported
Collation and projections
Projection data is sorted according to the default en_US@collation=binary collation. Thus, regardless of the session setting, issuing the following command creates a projection sorted by col1 according to the binary collation:
=> CREATE PROJECTION p1 AS SELECT * FROM table1 ORDER BY col1;
In such cases, straße and strasse are not stored near each other on disk.
Sorting by binary collation also means that sort optimizations do not work in locales other than binary. Vertica returns the following warning if you create tables or projections in a non-binary locale:
WARNING: Projections are always created and persisted in the default
Vertica locale. The current locale is de_DE
Non-binary locale input handling
When the locale is non-binary, Vertica uses the
COLLATION function to transform input to a binary string that sorts in the proper order.
This transformation increases the number of bytes required for the input according to this formula:
The default value of configuration parameter CollationExpansion is 5.
Character data type handling
CHAR fields are displayed as fixed length, including any trailing spaces. When CHAR fields are processed internally, they are first stripped of trailing spaces. For VARCHAR fields, trailing spaces are usually treated as significant characters; however, trailing spaces are ignored when sorting or comparing either type of character string field using a non-binary locale.
The maximum length parameter for VARCHAR and CHAR data type refers to the number of octets (bytes) that can be stored in that field and not number of characters. When using multi-byte UTF-8 characters, the fields must be sized to accommodate from 1 to 4 bytes per character, depending on the data.
2 - Specifying locale: long form
Vertica supports long forms that specify the collation keyword.
Vertica supports long forms that specify the collation keyword. Vertica extends long-form processing to accept collation arguments.
Locale specification strings are case insensitive. For example, en_us and EN_US, are equivalent.
You can substitute underscores with hyphens. For example: [-script]
Parameters
language
A two- or three-letter lowercase code for a particular language. For example, Spanish is es English is en and French is fr. The two-letter language code uses the ISO-639 standard.
_script
An optional four-letter script code that follows the language code. If specified, it should be a valid script code as listed on the Unicode ISO 15924 Registry.
_country
A specific language convention within a generic language for a specific country or region. For example, French is spoken in many countries, but the currencies are different in each country. To allow for these differences among specific geographical, political, or cultural regions, locales are specified by two-letter, uppercase codes. For example, FR represents France and CA represents Canada. The two letter country code uses the ISO-3166 standard.
_variant
Differences may also appear in language conventions used within the same country. For example, the Euro currency is used in several European countries while the individual country's currency is still in circulation. To handle variations inside a language and country pair, add a third code, the variant code. The variant code is arbitrary and completely application-specific. ICU adds _EURO to its locale designations for locales that support the Euro currency. Variants can have any number of underscored key words. For example, EURO_WIN is a variant for the Euro currency on a Windows computer.
Another use of the variant code is to designate the Collation (sorting order) of a locale. For instance, the es__TRADITIONAL locale uses the traditional sorting order which is different from the default modern sorting of Spanish.
@collation-spec
Vertica only supports the keyword collation, as follows:
@collation=collation-type[;arg]...
Collation can specify one or more semicolon-delimited arguments, described below.
collation-type is set to one of the following values:
big5han: Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese).
dict: For a dictionary-style ordering (such as in Sinhala).
direct: Hindi variant.
gb2312/gb2312han: Pinyin ordering for Latin, gb2312han charset ordering for CJK characters (used in Chinese).
phonebook: For a phonebook-style ordering (such as in German).
pinyin: Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin (used in Chinese).
reformed: Reformed collation (such as in Swedish).
standard: The default ordering for each language. For root it is [UCA] order; for each other locale it is the same as UCA (Unicode Collation Algorithm) ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they have effect only in certain locales.
stroke: Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese) not supported.
traditional: For a traditional-style ordering (such as in Spanish).
unihan: Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters (used in Chinese) not supported.
Collations might default to root, the ICU default collation.
Invalid values of the collation keyword and its synonyms do not cause an error. For example, the following does not generate an error. It simply ignores the invalid value:
=> \locale en_GB@collation=xyz
INFO 2567: Canonical locale: 'en_GB@collation=xyz'
Standard collation: 'LEN'
English (United Kingdom, collation=xyz)
collation can specify one or more of the following arguments :
Parameter
Short form
Description
colstrength
S
Sets the default strength for comparison. This feature is locale dependent.
Set colstrength to one of the following:
1 | primary: Ignores case and accents. Only primary differences are used during comparison—for example, a versus z.
2 | secondary: Ignores case. Only secondary and above differences are considered for comparison—for example, different accented forms of the same base letter such as a versus \u00E4.
3 | tertiary (default): Only tertiary differences and higher are considered for comparison. Tertiary comparisons are typically used to evaluate case differences—for example, Z versus z.
4| quarternary: For example, used with Hiragana.
colAlternate
A
Sets alternate handling for variable weights, as described in UCA, one of the following:
non-ignorable | N | D
shifted | S
colBackwards
F
For Latin with accents, this parameter determines which accents are sorted. It sets the comparison for the second level to be backwards.
Note
colBackwards is automatically set for French accents.
Set
colBackwards to one of the following:
on | O: The normal UCA algorithm is used.
off | X: All strings that are in Fast C or D normalization form (FCD) sort correctly, but others do not necessarily sort correctly. Set to off if the strings to be compared are in FCD.
colNormalization
N
Set to one of the following:
on | O: The normal UCA algorithm is used.
off | X: All strings that are in Fast C or D normalization form (FCD) sort correctly, but others won't necessarily sort correctly. It should only be set off if the strings to be compared are in FCD.
colCaseLevel
E
Set to one of the following:
on | O: A level consisting only of case characteristics is inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on.
off | X: This level is omitted.
colCaseFirst
C
Set to one of the following:
upper | U: Upper case sorts before lower case.
lower | L: Lower case sorts before upper case. This is useful for locales that have already supported ordering but require different order of cases. It affects case and tertiary levels.
off | short: Tertiary weights unaffected
colHiraganaQuarternary
H
Controls special treatment of Hiragana code points on quaternary level, one of the following:
on | O: Hiragana codepoints get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary for this attribute to take effect.
off | X: Hiragana letters are treated normally.
colNumeric
D
If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UCD]) is sorted at a primary level with its numeric value. For example, A-21 < A-123.
variableTop
B
Sets the default value for the variable top. All code points with primary weights less than or equal to the variable top will be considered variable, and are affected by the alternate handling.
For example, the following command sets variableTop to be HYPHEN (u2010)
Specify German locale as used in Germany (de), with phonebook-style collation:
=> \locale de_DE@collation=phonebook
INFO 2567: Canonical locale: 'de_DE@collation=phonebook'
Standard collation: 'KPHONEBOOK_LDE'
German (Germany, collation=Phonebook Sort Order)
Deutsch (Deutschland, Sortierung=Telefonbuch-Sortierregeln)
Specify German locale as used in Germany (de), with phonebook-style collation and strength set to secondary:
=> \locale de_DE@collation=phonebook;colStrength=secondary
INFO 2567: Canonical locale: 'de_DE@collation=phonebook'
Standard collation: 'KPHONEBOOK_LDE_S2'
German (Germany, collation=Phonebook Sort Order)
Deutsch (Deutschland, Sortierung=Telefonbuch-Sortierregeln)
3 - Specifying locale: short form
Vertica accepts locales in short form.
Vertica accepts locales in short form. You can use the short form to specify the locale and keyname pair/value names.
To determine the short form for a locale, type in the long form and view the last line of INFO, as follows:
\locale frINFO: Locale: 'fr'
INFO: French
INFO: français
INFO: Short form: 'LFR'
Examples
Specify en (English) locale:
\locale LENINFO: Locale: 'en'
INFO: English
INFO: Short form: 'LEN'
Specify German locale as used in Germany (de), with phonebook-style collation:
\locale LDE_KPHONEBOOKINFO: Locale: 'de@collation=phonebook'
INFO: German (collation=Phonebook Sort Order)
INFO: Deutsch (Sortierung=Telefonbuch-Sortierregeln)
INFO: Short form: 'KPHONEBOOK_LDE'
Specify German locale as used in Germany (de), with phonebook-style collation:
\locale LDE_KPHONEBOOK_S2INFO: Locale: 'de@collation=phonebook'
INFO: German (collation=Phonebook Sort Order)
INFO: Deutsch (Sortierung=Telefonbuch-Sortierregeln)
INFO: Short form: 'KPHONEBOOK_LDE_S2'
4 - Supported locales
The following are the supported locale strings for Vertica.
The following are the supported locale strings for Vertica. Each locale can optionally have a list of key/value pairs (see Specifying locale: long form).
Locale Name
Language or Variant
Region
af
Afrikaans
af_NA
Afrikaans
Namibian Afrikaans
af_ZA
Afrikaans
South Africa
am
Ethiopic
am_ET
Ethiopic
Ethiopia
ar
Arabic
ar_AE
Arabic
United Arab Emirates
ar_BH
Arabic
Bahrain
ar_DZ
Arabic
Algeria
ar_EG
Arabic
Egypt
ar_IQ
Arabic
Iraq
ar_JO
Arabic
Jordan
ar_KW
Arabic
Kuwait
ar_LB
Arabic
Lebanon
ar_LY
Arabic
Libya
ar_MA
Arabic
Morocco
ar_OM
Arabic
Oman
ar_QA
Arabic
Qatar
ar_SA
Arabic
Saudi Arabia
ar_SD
Arabic
Sudan
ar_SY
Arabic
Syria
ar_TN
Arabic
Tunisia
ar_YE
Arabic
Yemen
as
Assamese
as_IN
Assamese
India
az
Azerbaijani
az_Cyrl
Azerbaijani
Cyrillic
az_Cyrl_AZ
Azerbaijani
Azerbaijan Cyrillic
az_Latn
Azerbaijani
Latin
az_Latn_AZ
Azerbaijani
Azerbaijan Latin
be
Belarusian
be_BY
Belarusian
Belarus
bg
Bulgarian
bg_BG
Bulgarian
Bulgaria
bn
Bengali
bn_BD
Bengali
Bangladesh
bn_IN
Bengali
India
bo
Tibetan
bo_CN
Tibetan
PR China
bo_IN
Tibetan
India
ca
Catalan
ca_ES
Catalan
Spain
cs
Czech
cs_CZ
Czech
Czech Republic
cy
Welsh
cy_GB
Welsh
United Kingdom
da
Danish
da_DK
Danish
Denmark
de
German
de_AT
German
Austria
de_BE
German
Belgium
de_CH
German
Switzerland
de_DE
German
Germany
de_LI
German
Liechtenstein
de_LU
German
Luxembourg
el
Greek
el_CY
Greek
Cyprus
el_GR
Greek
Greece
en
English
en_AU
English
Australia
en_BE
English
Belgium
en_BW
English
Botswana
en_BZ
English
Belize
en_CA
English
Canada
en_GB
English
United Kingdom
en_HK
English
Hong Kong S.A.R. of China
en_IE
English
Ireland
en_IN
English
India
en_JM
English
Jamaica
en_MH
English
Marshall Islands
en_MT
English
Malta
en_NA
English
Namibia
en_NZ
English
New Zealand
en_PH
English
Philippines
en_PK
English
Pakistan
en_SG
English
Singapore
en_TT
English
Trinidad and Tobago
en_US
English
United States
en_US_POSIX
English
United States Posix
en_VI
English
U.S. Virgin Islands
en_ZA
English
Zimbabwe or South Africa
en_ZW
English
Zimbabwe
eo
Esperanto
es
Spanish
es_AR
Spanish
Argentina
es_BO
Spanish
Bolivia
es_CL
Spanish
Chile
es_CO
Spanish
Columbia
es_CR
Spanish
Costa Rica
es_DO
Spanish
Dominican Republic
es_EC
Spanish
Ecuador
es_ES
Spanish
Spain
es_GT
Spanish
Guatemala
es_HN
Spanish
Honduras
es_MX
Spanish
Mexico
es_NI
Spanish
Nicaragua
es_PA
Spanish
Panama
es_PE
Spanish
Peru
es_PR
Spanish
Puerto Rico
es_PY
Spanish
Paraguay
es_SV
Spanish
El Salvador
es_US
Spanish
United States
es_UY
Spanish
Uruguay
es_VE
Spanish
Venezuela
et
Estonian
et_EE
Estonian
Estonia
eu
Basque
Spain
eu_ES
Basque
Spain
fa
Persian
fa_AF
Persian
Afghanistan
fa_IR
Persian
Iran
fi
Finnish
fi_FI
Finnish
Finland
fo
Faroese
fo_FO
Faroese
Faroe Islands
fr
French
fr_BE
French
Belgium
fr_CA
French
Canada
fr_CH
French
Switzerland
fr_FR
French
France
fr_LU
French
Luxembourg
fr_MC
French
Monaco
fr_SN
French
Senegal
ga
Gaelic
ga_IE
Gaelic
Ireland
gl
Gallegan
gl_ES
Gallegan
Spain
gsw
German
gsw_CH
German
Switzerland
gu
Gujurati
gu_IN
Gujurati
India
gv
Manx
gv_GB
Manx
United Kingdom
ha
Hausa
ha_Latn
Hausa
Latin
ha_Latn_GH
Hausa
Ghana (Latin)
ha_Latn_NE
Hausa
Niger (Latin)
ha_Latn_NG
Hausa
Nigeria (Latin)
haw
Hawaiian
Hawaiian
haw_US
Hawaiian
United States
he
Hebrew
he_IL
Hebrew
Israel
hi
Hindi
hi_IN
Hindi
India
hr
Croation
hr_HR
Croation
Croatia
hu
Hungarian
hu_HU
Hungarian
Hungary
hy
Armenian
hy_AM
Armenian
Armenia
hy_AM_REVISED
Armenian
Revised Armenia
id
Indonesian
id_ID
Indonesian
Indonesia
ii
Sichuan
ii_CN
Sichuan
Yi
is
Icelandic
is_IS
Icelandic
Iceland
it
Italian
it_CH
Italian
Switzerland
it_IT
Italian
Italy
ja
Japanese
ja_JP
Japanese
Japan
ka
Georgian
ka_GE
Georgian
Georgia
kk
Kazakh
kk_Cyrl
Kazakh
Cyrillic
kk_Cyrl_KZ
Kazakh
Kazakhstan (Cyrillic)
kl
Kalaallisut
kl_GL
Kalaallisut
Greenland
km
Khmer
km_KH
Khmer
Cambodia
kn
Kannada
kn-IN
Kannada
India
ko
Korean
ko_KR
Korean
Korea
kok
Konkani
kok_IN
Konkani
India
kw
Cornish
kw_GB
Cornish
United Kingdom
lt
Lithuanian
lt_LT
Lithuanian
Lithuania
lv
Latvian
lv_LV
Latvian
Latvia
mk
Macedonian
mk_MK
Macedonian
Macedonia
ml
Malayalam
ml_IN
Malayalam
India
mr
Marathi
mr_IN
Marathi
India
ms
Malay
ms_BN
Malay
Brunei
ms_MY
Malay
Malaysia
mt
Maltese
mt_MT
Maltese
Malta
nb
Norwegian Bokml
nb_NO
Norwegian Bokml
Norway
ne
Nepali
ne_IN
Nepali
India
ne_NP
Nepali
Nepal
nl
Dutch
nl_BE
Dutch
Belgium
nl_NL
Dutch
Netherlands
nn
Norwegian nynorsk
nn_NO
Norwegian nynorsk
Norway
om
Oromo
om_ET
Oromo
Ethiopia
om_KE
Oromo
Kenya
or
Oriya
or_IN
Oriya
India
pa
Punjabi
pa_Arab
Punjabi
Arabic
pa_Arab_PK
Punjabi
Pakistan (Arabic)
pa_Guru
Punjabi
Gurmukhi
pa_Guru_IN
Punjabi
India (Gurmukhi)
pl
Polish
pl_PL
Polish
Poland
ps
Pashto
ps_AF
Pashto
Afghanistan
pt
Portuguese
pt_BR
Portuguese
Brazil
pt_PT
Portuguese
Portugal
ro
Romanian
ro_MD
Romanian
Moldavia
ro_RO
Romanian
Romania
ru
Russian
ru_RU
Russian
Russia
ru_UA
Russian
Ukraine
si
Sinhala
si_LK
Sinhala
Sri Lanka
sk
Slovak
sk_SK
Slovak
Slovakia
sl
Slovenian
sl_SL
Slovenian
Slovenia
so
Somali
so_DJ
Somali
Djibouti
so_ET
Somali
Ethiopia
so_KE
Somali
Kenya
so_SO
Somali
Somalia
sq
Albanian
sq_AL
Albanian
Albania
sr
Serbian
sr_Cyrl
Serbian
Cyrillic
sr_Cyrl_BA
Serbian
Bosnia and Herzegovina (Cyrillic)
sr_Cyrl_ME
Serbian
Montenegro (Cyrillic)
sr_Cyrl_RS
Serbian
Serbia (Cyrillic)
sr_Latn
Serbian
Latin
sr_Latn_BA
Serbian
Bosnia and Herzegovina (Latin)
sr_Latn_ME
Serbian
Montenegro (Latin)
sr_Latn_RS
Serbian
Serbia (Latin)
sv
Swedish
sv_FI
Swedish
Finland
sv_SE
Swedish
Sweden
sw
Swahili
sw_KE
Swahili
Kenya
sw_TZ
Swahili
Tanzania
ta
Tamil
ta_IN
Tamil
India
te
Telugu
te_IN
Telugu
India
th
Thai
th_TH
Thai
Thailand
ti
Tigrinya
ti_ER
Tigrinya
Eritrea
ti_ET
Tigrinya
Ethiopia
tr
Turkish
tr_TR
Turkish
Turkey
uk
Ukrainian
uk_UA
Ukrainian
Ukraine
ur
Urdu
ur_IN
Urdu
India
ur_PK
Urdu
Pakistan
uz
Uzbek
uz_Arab
Uzbek
Arabic
uz_Arab_AF
Uzbek
Afghanistan (Arabic)
uz_Cryl
Uzbek
Cyrillic
uz_Cryl_UZ
Uzbek
Uzbekistan (Cyrillic)
uz_Latin
Uzbek
Latin
us_Latin_UZ
Uzbekistan (Latin)
vi
Vietnamese
vi_VN
Vietnamese
Vietnam
zh
Chinese
zh_Hans
Chinese
Simplified Han
zh_Hans_CN
Chinese
China (Simplified Han)
zh_Hans_HK
Chinese
Hong Kong SAR China (Simplified Han)
zh_Hans_MO
Chinese
Macao SAR China (Simplified Han)
zh_Hans_SG
Chinese
Singapore (Simplified Han)
zh_Hant
Chinese
Traditional Han
zh_Hant_HK
Chinese
Hong Kong SAR China (Traditional Han)
zh_Hant_MO
Chinese
Macao SAR China (Traditional Han)
zh_Hant_TW
Chinese
Taiwan (Traditional Han)
zu
Zulu
zu_ZA
Zulu
South Africa
5 - Locale and UTF-8 support
Vertica supports Unicode Transformation Format-8, or UTF8, where 8 equals 8-bit.
Vertica supports Unicode Transformation Format-8, or UTF8, where 8 equals 8-bit. UTF-8 is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. UTF-8 can represent any universal character in the Unicode standard. Initial encoding of byte codes and character assignments for UTF-8 coincides with ASCII. Thus, UTF8 requires little or no change for software that handles ASCII but preserves other values.
Vertica database servers expect to receive all data in UTF-8, and Vertica outputs all data in UTF-8. The ODBC API operates on data in UCS-2 on Windows systems, and normally UTF-8 on Linux systems. JDBC and ADO.NET APIs operate on data in UTF-16. Client drivers automatically convert data to and from UTF-8 when sending to and receiving data from Vertica using API calls. The drivers do not transform data loaded by executing a COPY or COPY LOCAL statement.
UTF-8 string functions
The following string functions treat VARCHAR arguments as UTF-8 strings (when USING OCTETS is not specified) regardless of locale setting.
Searches string for substring and returns an integer indicating the position of the character in string that is the first character of this occurrence.
Returns an integer value representing the character location of a specified substring within a string (counting from one).
6 - Locale-aware string functions
Vertica provides string functions to support internationalization.
Vertica provides string functions to support internationalization. Unless otherwise specified, these string functions can optionally specify whether VARCHAR arguments should be interpreted as octet (byte) sequences, or as (locale-aware) sequences of characters. Specify this information by adding the parameter USING OCTETS and USING CHARACTERS (default) to the function.
The following table lists all string functions that are locale-aware:
Searches string for substring and returns an integer indicating the position of the character in string that is the first character of this occurrence.