Handling Non-UTF-8 input
Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not directly supported.
If you have data that does not meet the UTF-8 standard, you can modify the data during the load or you can transform the data files before loading.
Checking data format
Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The file
command reports the encoding of any text files. For example:
$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text
The file
command could indicate ASCII text even though the file contains multibyte characters.
To check for multibyte characters in an ASCII file, use the wc
command. For example:
$ wc Date_Dimension.tbl
1828 5484 221822 Date_Dimension.tbl
If the wc
command returns an error such as Invalid or incomplete multibyte or wide character
, the data file is using an incompatible character set.
This example shows two files that are not UTF-8 data files:
$ file data*
data1.txt: Little-endian UTF-16 Unicode text
data2.txt: ISO-8859 text
The results indicate that neither of the files is in UTF-8 format.
Converting data while loading
You can remove or replace non-UTF-8 characters in text data during the load. The MAKEUTF8 function removes such characters by default, or you can specify a replacement string.
The following example shows how to use this function during a load. The original data is loaded into the orig_name
column, and the transformed data is loaded into the name
column. Typically you would use a FILLER column for the original value instead of adding the column to the table definition; this example adds the column to show the differences side by side.
=> CREATE TABLE people (orig_name VARCHAR, name VARCHAR);
CREATE TABLE
=> COPY people (orig_name, name AS MAKEUTF8(orig_name)) FROM ...;
Rows Loaded
-------------
8
(1 row)
=> SELECT * FROM people;
orig_name | name
----------+--------
Dáithí | Dith
Fíona | Fona
Móirín | Mirn
Róisín | Risn
Séamus | Samus
Séan | San
Tiarnán | Tiarnn
Áine | ine
(8 rows)
For general information about transforming data, see Transforming data during loads.
Converting files before loading data
To convert files before loading them into Vertica, use the iconv
UNIX command. For example, to convert the data2.txt
file from the previous example, use the iconv
command as follows:
$ iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt
See the man pages for file
and iconv
for more information.
Checking UTF-8 compliance after loading data
After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named people
that has a VARCHAR column named name
, you can use this statement to verify that all of the strings are UTF-8 encoded:
=> SELECT name FROM people WHERE NOT ISUTF8(name);
If all of the strings are in UTF-8 format, the query should not return any rows.