This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data formats

COPY supports many data formats, detailed in the sections that follow.

1: Delimited data
2: Binary (native) data
3: Native varchar data
4: Fixed-width format data
5: ORC data
6: Parquet data
7: JSON data
8: Avro data
9: Matches from regular expressions
10: Common event format (CEF) data

COPY supports many data formats, detailed in the sections that follow. You specify a data format by specifying a parser.

By default, COPY uses the DELIMITED parser (Delimited data) to load raw data into the database. Raw input data must be in UTF-8, delimited text format. Other parsers support other data formats.

The syntax for specifying which parser to use varies. The description of each parser includes this information.

The same COPY statement cannot mix raw data types that require different parsers, such as NATIVE and FIXEDWIDTH. You can, however, load data of different formats, using different parsers, into the same table using separate COPY statements.

For information about verifying input data formats, see Handling Non-UTF-8 input.

All parsers described in this section can be used with conventional tables (those created with CREATE TABLE or CREATE EXTERNAL TABLE). Some also support Flex tables (CREATE FLEX TABLE). For more information specific to Flex tables, see Using flex table parsers.

All parsers support all primitive data types and several support one-dimensional arrays of primitive types. Some parsers support other complex types. See the documentation of individual parsers for information about supported types.

1 - Delimited data

If you do not specify another parser, Vertica defaults to the DELIMITED parser.

If you do not specify another parser, Vertica defaults to the DELIMITED parser. You can specify the delimiter, escape characters, how to handle null values, and other parameters in the COPY statement.

The following example shows the default behavior, in which the delimiter character is '|'

=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE

=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.

=> SELECT * FROM employees;
 id |      name       |  department
----+-----------------+--------------
 17 | Howard Wolowitz | Astrophysics
 42 | Sheldon Cooper  | Physics
(2 rows)

By default, collection values are delimited by brackets and elements are delimited by commas. Collections must be one-dimensional arrays or sets of scalar types.

=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE

=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.

=> SELECT * FROM researchers;
 id |      name       |               grants               |       values
----+-----------------+------------------------------------+---------------------
 17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
 42 | Sheldon Cooper  | ["US-7376","DARPA-1567"]           | [65000,135000]
(2 rows)

To use a special character as a literal, prefix it with an escape character. For example, to include a literal backslash () in the loaded data (such as when including a file path), use two backslashes (\). COPY removes the escape character from the input when it loads escaped characters.

When loading delimited data, two consecutive delimiters indicate a null value, unless the NULL parameter is set otherwise. The final delimiter is optional. For example, the following input is valid for the previous table:


=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 45|Raj|
>> 21|Leonard||
>> \.

=> SELECT * FROM employees;
 id |  name   |  department
----+---------+--------------
 21 | Leonard |
 42 | Raj     |
(2 rows)

By default, if the data has too few values, the load fails. You can use the TRAILING NULLCOLS option to accept any number of missing columns and treat their values as null.

Vertica assumes that data is in the UTF-8 encoding.

The options specific to the DELIMITED parser and their default values are:

Option	Default
DELIMITER	\|
ENCLOSED BY	"
ESCAPE	\
NULL	'' (empty string)
COLLECTIONOPEN	[
COLLECTIONCLOSE	]
COLLECTIONDELIMITER	,
COLLECTIONNULLELEMENT	null
COLLECTIONENCLOSE	" (double quote)
TRAILING NULLCOLS	(none)

To load delimited data into a Flex table, use the FDELIMITEDPARSER parser.

Changing the column separator (DELIMITER)

The default COPY delimiter is a vertical bar ('|'). The DELIMITER is a single ASCII character used to separate columns within each record of an input source. Between two delimiters, COPY interprets all string data in the input as characters. Do not enclose character strings in quotes, because quote characters are also treated as literals between delimiters.

You can define a different delimiter using any ASCII value in the range E'\000' to E'\177' inclusive. For instance, if you are loading CSV data files, and the files use a comma (',') character as a delimiter, you can change the default delimiter to a comma. You cannot use the same character for both the DELIMITER and NULL options.

If the delimiter character is among a string of data values, use the ESCAPE AS character ('' by default) to indicate that the delimiter should be treated as a literal.

The COPY statement accepts empty values (two consecutive delimiters) as valid input data for CHAR and VARCHAR data types. COPY stores empty columns as an empty string (''). An empty string is not equivalent to a NULL string.

To indicate a non-printing delimiter character (such as a tab), specify the character in extended string syntax (E'...'). If your database has StandardConformingStrings enabled, use a Unicode string literal (U&'...'). For example, use either E'\t' or U&'\0009' to specify tab as the delimiter.

The following example loads data from a comma-separated file:

=> COPY employees FROM ... DELIMITER ',';

In the following example, the first column has a column-specific delimiter:

=> COPY employees(id DELIMITER ':', name, department) FROM ... DELIMITER ',';

Changing collection delimiters (COLLECTIONDELIMITER, COLLECTIONOPEN, COLLECTIONCLOSE)

The DELIMITER option specifies the value that separates columns in the input. For a column with a collection type (ARRAY or SET), a delimiter is also needed between elements of the collection. In addition, the collection itself has start and end markers. By default, collections are enclosed in brackets and elements are delimited by commas, but you can change these values.

In the following example, collections are enclosed in braces and delimited by periods.

=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672"."darpa-1963"}|{16200.16700}
>> \.

=> SELECT * FROM researchers;
 id |      name       |               grants               |       values
----+-----------------+------------------------------------+---------------------
 17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
 42 | Sheldon Cooper  | ["US-7376","DARPA-1567"]           | [65000,135000]
 19 | Leonard         | ["us-1672","darpa-1963"]           | [16200,16700]
(3 rows)

Changing the character enclosing column or collection values (ENCLOSED BY, COLLECTIONENCLOSE)

The ENCLOSED BY parameter lets you set an ASCII character to delimit characters to embed in string values. The enclosing character is not considered to be part of the data if and only if it is the first and last character of the input. You can use any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000') for the ENCLOSED BY value. Using double quotation marks (") is common, as shown in the following example.

=> COPY employees FROM STDIN ENCLOSED BY '"';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 21|Leonard|Physics
>> 42|"Sheldon"|"Physics"
>> 17|Rajesh "Raj" K|Astronomy
>> \.

=> SELECT * FROM employees;
 id |     name       |  department
----+----------------+--------------
 17 | Rajesh "Raj" K | Astronomy
 21 | Leonard        | Physics
 42 | Sheldon        | Physics
(3 rows)

Notice that while ENCLOSED BY is a double quote, the embedded quotes in Rajesh's name are treated as part of the data because they are not the first and last characters in the column. The quotes that enclose "Sheldon" and "Physics" are dropped because of their positions.

Within a collection value, the COLLECTIONENCLOSE parameter is like ENCLOSED BY for individual elements of the collection.

Changing the null indicator (NULL)

By default, an empty string ('') for a column value means NULL. You can specify a different ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NUL: E'\000') as the NULL indicator. You cannot use the same character for both the DELIMITER and NULL options.

A column containing one or more whitespace characters is not NULL unless the sequence of whitespace exactly matches the NULL string.

A NULL is case-insensitive and must be the only value between the data field delimiters. For example, if the null string is NULL and the delimiter is the default vertical bar (|):

|NULL| indicates a null value.

| NULL | does not indicate a null value.

When you use the COPY statement in a script, you must substitute a double-backslash for each null string that includes a backslash. For example, the scripts used to load the example database contain:

COPY ... NULL E'\\n' ...

Changing the null indicator for collection values (COLLECTIONNULLELEMENT)

The NULL option specifies the value to be treated as null for a column value. For a column with a collection type (ARRAY or SET), a separate option specifies how to interpret null elements. By default, "null" indicates a null value. An empty value, meaning two consecutive element delimiters, does not indicate null:

=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 17|Howard|["nasa-143",,"nasa-6262"]|[10000,1650,15367]
>> 19|Leonard|["us-177",null,"us-6327"]|[16200,64000,26500]
>> \.

=> SELECT * FROM researchers;
 id |  name   |           grants            |       values
----+---------+-----------------------------+---------------------
 17 | Howard  | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
 19 | Leonard | ["us-177",null,"us-6327"]   | [16200,64000,26500]
(2 rows)

Use COLLECTIONNULLELEMENT to specify a different value, as in the following example.

=> COPY researchers from STDIN COLLECTIONNULLELEMENT 'x';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon|[x,"us-1672"]|[x,165000]
>> \.

=> SELECT * FROM researchers;
 id |  name   |           grants            |       values
----+---------+-----------------------------+---------------------
 17 | Howard  | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
 19 | Leonard | ["us-177",null,"us-6327"]   | [16200,64000,26500]
 42 | Sheldon | [null, "us-1672"]           | [null,165000]
(3 rows)

Filling missing columns (TRAILING NULLCOLS)

By default, COPY fails if the input does not contain enough columns. Use the TRAILING NULLCOLS option to instead insert NULL values for any columns that lack data. This option cannot be used with columns that have a NOT NULL constraint.

The following example demonstrates use of this option.

=> CREATE TABLE z (a INT, b INT, c INT );

--- insert with enough data:
=> INSERT INTO z VALUES (1, 2, 3);

=> SELECT * FROM z;
 a | b | c
---+---+---
 1 | 2 | 3
(1 row)

--- insert deficient data:
=> COPY z FROM STDIN TRAILING NULLCOLS;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 4 | 5 | 6
>> 7 | 8
>> \.

=> SELECT * FROM z;
 a | b | c
---+---+---
 1 | 2 | 3
 4 | 5 | 6
 7 | 8 |
(3 rows)

Changing the escape character (ESCAPE AS, NO ESCAPE)

You can specify an escape character, which enables any special characters to be treated as part of the data. For example, if an element from a CSV file should contain a comma, you can indicate that by pre-pending the escape character to the comma in the data. The default escape character is a backslash ().

To change the default to a different character, use the ESCAPE AS option. You can set the escape character to be any ASCII value in the range E'\001' to E'\177' inclusive.

If you do not want any escape character and want to prevent any characters from being interpreted as escape sequences, use the NO ESCAPE option.

ESCAPE AS and NO ESCAPE can be set at both the column and global levels.

Changing the end-of-line character (RECORD TERMINATOR)

To specify the literal character string that indicates the end of a data file record, use the RECORD TERMINATOR parameter, followed by the string to use. If you do not specify a value, then Vertica attempts to determine the correct line ending, accepting either just a linefeed (E'\n') common on UNIX systems, or a carriage return and linefeed (E'\r\n') common on Windows platforms.

For example, if your file contains comma-separated values terminated by line feeds that you want to maintain, use the RECORD TERMINATOR option to specify an alternative value:

=>  COPY mytable FROM STDIN DELIMITER ',' RECORD TERMINATOR E'\n';

To specify the RECORD TERMINATOR as non-printing characters, use either the extended string syntax or Unicode string literals. The following table lists some common record terminator characters. See String Literals for an explanation of the literal string formats.

Extended String Syntax	Unicode Literal String	Description	ASCII Decimal
`E'\b'`	`U&'\0008'`	Backspace	8
`E'\t'`	`U&'\0009'`	Horizontal tab	9
`E'\n'`	`U&'\000a'`	Linefeed	10
`E'\f'`	`U&'\000c'`	Formfeed	12
`E'\r'`	`U&'\000d'`	Carriage return	13
`E'\\'`	`U&'\005c'`	Backslash	92

If you use the RECORD TERMINATOR option to specify a custom value, be sure the input file matches the value. Otherwise, you may get inconsistent data loads.

Note

The record terminator cannot be the same as DELIMITER, NULL, ESCAPE, or ENCLOSED BY.

If using JDBC, Vertica recommends that you use the following value for RECORD TERMINATOR:

System.getProperty("line.separator")

2 - Binary (native) data

You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option.

You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option. Since binary-format data does not require the use and processing of delimiters, it precludes the need to convert integers, dates, and timestamps from text to their native storage format, and improves load performance over delimited data. All binary-format files must adhere to the formatting specifications described in Appendix: creating native binary format files.

Native binary format data files are typically larger than their delimited text format counterparts, so compress the data before loading it. The NATIVE parser does not support concatenated compressed binary files. You can load native (binary) format files when developing plug-ins to ETL applications.

There is no copy format to load binary data byte-for-byte because the column and record separators in the data would have to be escaped. Binary data type values are padded and translated on input, and also in the functions, operators, and casts supported.

Loading hexadecimal, octal, and bitstring data

You can use the formats hexadecimal, octal, and bitstring only to load binary columns. To specify these column formats, use the COPY statement's FORMAT options:

Hexadecimal
Octal
Bitstring

The following examples illustrate how to use the FORMAT option.

Create a table:

=> CREATE TABLE t(oct VARBINARY(5),
     hex VARBINARY(5),
     bitstring VARBINARY(5) );

Create the projection:

=> CREATE PROJECTION t_p(oct, hex, bitstring) AS SELECT * FROM t;

Use a COPY statement with the STDIN clause, specifying each of the formats:

=> COPY t (oct FORMAT 'octal', hex FORMAT 'hex',
           bitstring FORMAT 'bitstring')
   FROM STDIN DELIMITER ',';

Enter the data to load, ending the statement with a backslash () and a period (.) on a separate line:

Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 141142143144145,0x6162636465,0110000101100010011000110110010001100101
>> \.

Use a select query on table t to view the input values results:

=> SELECT * FROM t;
 oct   | hex   | bitstring
-------+-------+-----------
abcde  | abcde | abcde
(1 row)

COPY uses the same default format to load binary data, as used to input binary data. Since the backslash character ('\') is the default escape character, you must escape octal input values. For example, enter the byte '\141' as '\\141'.

Note

If you enter an escape character followed by an invalid octal digit or an escape character being escaped, COPY returns an error.

On input, COPY translates string data as follows:

Uses the HEX_TO_BINARY function to translate from hexadecimal representation to binary.
Uses the BITSTRING_TO_BINARY function to translate from bitstring representation to binary.

Both functions take a VARCHAR argument and return a VARBINARY value.

You can also use the escape character to represent the (decimal) byte 92 by escaping it twice; for example, '\\\\'. Note that vsql inputs the escaped backslash as four backslashes. Equivalent inputs are hex value '0x5c' and octal value '\134' (134 = 1 x 8^2 + 3 x 8^1 + 4 x 8^0 = 92).

You can load a delimiter value if you escape it with a backslash. For example, given delimiter '|', '\\001\|\\002' is loaded as {1,124,2}, which can also be represented in octal format as '\\001\\174\\002'.

If you insert a value with more bytes than fit into the target column, COPY returns an error. For example, if column c1 is VARBINARY(1):

=> INSERT INTO t (c1) values ('ab');   ERROR: 2-byte value too long for type Varbinary(1)

If you implicitly or explicitly cast a value with more bytes than fit the target data type, COPY silently truncates the data. For example:

=> SELECT 'abcd'::binary(2);
 binary
--------
 ab
(1 row)

Hexadecimal data

The optional '0x' prefix indicates that a value is hexadecimal, not decimal, although not all hexadecimal values use A-F; for example, 5396. COPY ignores the 0x prefix when loading the input data.

If there are an odd number of characters in the hexadecimal value, the first character is treated as the low nibble of the first (furthest to the left) byte.

Octal data

Loading octal format data requires that each byte be represented by a three-digit octal code. The first digit must be in the range [0,3] and the second and third digits must both be in the range [0,7].

If the length of an octal value is not a multiple of three, or if one of the three digits is not in the proper range, the value is invalid and COPY rejects the row in which the value appears. If you supply an invalid octal value, COPY returns an error. For example:

=> SELECT '\\000\\387'::binary(8);
ERROR: invalid input syntax for type binary

Rows that contain binary values with invalid octal representations are also rejected. For example, COPY rejects '\\008' because '\\ 008' is not a valid octal number.

BitString data

Loading bitstring data requires that each character must be zero (0) or one (1), in multiples of eight characters. If the bitstring value is not a multiple of eight characters, COPY treats the first n characters as the low bits of the first byte (furthest to the left), where n is the remainder of the value's length, divided by eight.

Examples

The following example shows VARBINARY HEX_TO_BINARY(VARCHAR) and VARCHAR TO_HEX(VARBINARY) usage.

Create table t and and its projection with binary columns:

=> CREATE TABLE t (c BINARY(1));
=> CREATE PROJECTION t_p (c) AS SELECT c FROM t;

Insert minimum and maximum byte values, including an IP address represented as a character string:

=> INSERT INTO t values(HEX_TO_BINARY('0x00'));
=> INSERT INTO t values(HEX_TO_BINARY('0xFF'));
=> INSERT INTO t values (V6_ATON('2001:DB8::8:800:200C:417A'));

Use the TO_HEX function to format binary values in hexadecimal on output:

=> SELECT TO_HEX(c) FROM t;
 to_hex
--------
 00
 ff
 20
(3 rows)

3 - Native varchar data

Use the NATIVE VARCHAR parser option when the raw data consists primarily of CHAR or VARCHAR data.

Use the NATIVE VARCHAR parser option when the raw data consists primarily of CHAR or VARCHAR data. COPY performs the conversion to the actual table data types on the database server. This parser option is not supported with COPY LOCAL.

Using NATIVE VARCHAR does not provide the same efficiency as NATIVE. However, NATIVE VARCHAR precludes the need to use delimiters or to escape special characters, such as quotes, which can make working with client applications easier.

Note

NATIVE VARCHAR does not support concatenated compressed files.

Batch data inserts performed through the Vertica ODBC and JDBC drivers automatically use the NATIVE VARCHAR format.

4 - Fixed-width format data

Use the FIXEDWIDTH parser option to bulk load fixed-width data.

Use the FIXEDWIDTH parser option to bulk load fixed-width data. You must specify the COLSIZES option values to specify the number of bytes for each column. The definition of the table you are loading (COPY table f (x, y, z)) determines the number of COLSIZES values to declare.

To load fixed-width data, use the COLSIZES option to specify the number of bytes for each input column. If any records do not have values, COPY inserts one or more null characters to equal the specified number of bytes. The last record in a fixed-width data file must include a record terminator to determine the end of the load data.

The following COPY options are not supported:

DELIMITER
ENCLOSED BY
ESCAPE AS
TRAILING NULLCOLS

Using nulls in fixed-width data

The default NULL string for a fixed-width load cannot be an empty string, and instead, consists of all spaces. The number of spaces depends on the column width declared with the COLSIZES (integer, [,...]) option.

For fixed-width loads, the NULL definition depends on whether you specify NULL at the column or statement level:

Statement level: NULL must be defined as a single-character. The default (or custom) NULL character is repeated for the entire width of the column.
Column level: NULL must be defined as a string whose length matches the column width.

For fixed-width loads, if the input data column has fewer values than the specified column size, COPY inserts NULL characters. The number of NULLs must match the declared column width. If you specify a NULL string at the column level, COPY matches the string with the column width.

Note

To turn off NULLs, use the NULL AS option and specify NULL AS ''.

Defining a null character (statement level)

Create a two-column table (fw):

=> CREATE TABLE fw(co int, ci int);
CREATE TABLE

Copy the table, specifying null as 'N', and enter some data:

=> COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) null AS 'N' NO COMMIT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> NN12
>> 23NN
>> NNNN
>> nnnn
>> \.

Select all (*) from the table:

=> SELECT * FROM fw;
 co | ci
----+----
    | 12
 23 |
(2 rows)

Defining a custom record terminator

To define a record terminator other than the COPY default when loading fixed-width data, take these steps:

Create table fw with two columns, co and ci:

=> CREATE TABLE fw(co int, ci int);
CREATE TABLE

Copy table fw, specifying two 2-byte column sizes, and specifying a comma (,) as the record terminator:

=> COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) RECORD TERMINATOR ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1234,1444,6666
>> \.

Query all data in table fw:

=> SELECT * FROM fw;
 co | ci
----+----
 12 | 34
 14 | 44
(2 rows)

The SELECT output indicates only two values. COPY rejected the third value (6666) because it was not followed by a comma (,) record terminator. Fixed-width data requires a trailing record terminator only if you explicitly specify a record terminator explicitly.

Copying fixed-width data

Use COPY FIXEDWIDTH COLSIZES (n [,...) to load files into a Vertica database. By default, all spaces are NULLs. For example:

=> CREATE TABLE mytest(co int, ci int);
=> CREATE PROJECTION mytest_p1 AS SELECT * FROM mytest SEGMENTED BY HASH(co) ALL NODES;
=> COPY mytest(co,ci) FROM STDIN FIXEDWIDTH colsizes(6,4) NO COMMIT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> \.
=> SELECT * FROM mytest ORDER BY co;
 co | ci
----+----
(0 rows)

Skipping content in fixed-width data

The COPY statement has two options to skip input data. The SKIP BYTES option is only for fixed-width data loads:

`SKIP BYTES` `num-bytes`	Skips the specified number of bytes from the input data.
`SKIP` `num-records`	Skips the specified number of records.

The following example uses SKIP BYTES to skip 11 bytes when loading a fixed-width table with two columns (4 and 6 bytes):

Copy a table using SKIP BYTES:

=> COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP BYTES 11;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2222666666
>> 1111999999
>> 1632641282
>> \.

Query all data in table fw:

=> SELECT * FROM fw ORDER BY co;
  co  |   ci
------+--------
 1111 | 999999
 1632 | 641282
(2 rows)

The output confirms that COPY skipped the first 11 bytes of loaded data.

The following example uses SKIP when loading a fixed-width (4,6) table:

Copy a table, using SKIP to skip two records of input data:

=> COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP 2;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2222666666
>> 1111999999
>> 1632641282
>> 3333888888
>> \.

Query all data in table fw:

=> SELECT * FROM fw ORDER BY co;
  co  |   ci
------+--------
 1632 | 641282
 3333 | 888888
(2 rows)

The output confirms that COPY skipped the first two records of load data.

Trimming characters in fixed-width data loads

Use the TRIM option to trim a character. TRIM accepts a single-byte character, which is trimmed at the beginning and end of the data. For fixed-width data loads, when you specify a TRIM character, COPY first checks to see if the row is NULL. If the row is not null, COPY trims the character(s). The next example instructs COPY to trim the character A, and shows the results:

Copy table fw, specifying TRIM character A:

=> COPY fw FROM STDIN FIXEDWIDTH colsizes(4,6) TRIM 'A';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> A22A444444
>> A22AA4444A
>> \.

Query all data in table fw:

=> SELECT * FROM fw ORDER BY co;
 co |   ci
----+--------
 22 |   4444
 22 | 444444
(2 rows)

Using padding in fixed-width data loads

By default, the padding character is ' ' (a single space). The padding behavior for fixed-width data loads is similar to how a space is treated in other formats, differing by data type as follows:

Data type	Padding
`Integer`	Leading and trailing spaces
`Bool`	Leading and trailing spaces
`Float`	Leading and trailing spaces
`[var]Binary`	None, all characters are significant.
`[Var]Char`	Trailing spaces if string is too large
`DateInterval`, `Time`, `Timestamp`, `TimeTZ`	None, all characters are significant. COPY uses an internal algorithm to parse these data types.
`Date` (formatted)	Use the COPY FORMAT option string to match the expected column length.
`Numerics`	Leading and trailing spaces

5 - ORC data

The ORC (Optimized Row Columnar) format is a column-oriented file format.

The ORC (Optimized Row Columnar) format is a column-oriented file format. Vertica has a parser for this format that can take advantage of the columnar layout.

In the COPY statement, specify the parser as follows. Note that this is not the usual PARSER parser-name syntax; COPY supports ORC directly:

=> COPY tableName FROM path ORC[(...)];

The parser takes several optional parameters; see the ORC reference page.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Schema matching

By default, the ORC parser uses strong schema matching. This means that the load must consume all columns in the Parquet data in the order they occur in the data. You can, instead, use loose schema matching, which allows you to select the columns you want and ignore the rest. (You must specify all struct fields.) Loose schema matching depends on the names of the columns in the data rather than their order, so the names in your table must match those in the data. Types must match or be coercible. For more information on how to use loose schema matching, see Loose Schema Matching on the ORC reference page.

Supported data types

Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.

Timestamps and time zones

To correctly report timestamps, Vertica must know what time zone the data was written in. Older versions of the ORC format do not record the time zone.

Hive version 1.2.0 and later records the writer time zone in the stripe footer. Vertica uses that time zone to make sure the timestamp values read into the database match the ones written in the source file. For ORC files that are missing this time zone information, Vertica assumes the values were written in the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table. Check for events of this type after your first query to verify that timestamps are being handled as you expected.

6 - Parquet data

Parquet is a column-oriented file format.

Parquet is a column-oriented file format. Vertica has a parser that can take advantage of the columnar layout.

In the COPY statement, specify the parser as follows. Note that this is not the usual PARSER parser-name syntax; COPY supports Parquet directly:


=> COPY tableName FROM path PARQUET[(...)];

The parser takes several optional parameters; see the PARQUET reference page.

Schema matching

By default, the Parquet parser uses strong schema matching. This means that the load must consume all columns in the Parquet data in the order they occur in the data. You can, instead, use loose schema matching, which allows you to select the columns and struct fields you want and ignore the rest. Loose schema matching depends on the names of the columns and fields in the data rather than their order, so the names in your table must match those in the data. Types must match or be coercible. For more information on how to use loose schema matching, see Loose Schema Matching on the PARQUET reference page.

Use the do_soft_schema_match_by_name parameter to specify loose schema matching. In the following example, the Parquet data contains more columns than those used in the table.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(do_soft_schema_match_by_name='True');

=> SELECT * from restaurant;
       name        | cuisine
-------------------+----------
 Bob's pizzeria    | Italian
 Bakersfield Tacos | Mexican
(2 rows)

Metadata caching

Parquet files include metadata that Vertica uses when loading data. To avoid repeatedly fetching this data, particularly from remote sources or where API calls incur financial costs, Vertica caches this metadata on each participating node during the planning phase for use during the execution phase.

Vertica uses TEMP storage for the cache, and only if TEMP storage is on the local file system.

You can limit the size of the cache by setting the ParquetMetadataCacheSizeMB configuration parameter. The default is 4GB.

Supported data types

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.

For the Parquet format only, you can use flexible complex types instead of fully specifying the schema for complex types. See Flexible complex types.

Timestamps and time zones

To correctly report timestamps, Vertica must know what time zone the data was written in. Hive does not record the writer time zone. Vertica assumes timestamp values were written in the local time zone and reports a warning at query time.

Hive provides an option, when writing Parquet files, to record timestamps in the local time zone. If you are using Parquet files that record times in this way, set the UseLocalTzForParquetTimestampConversion configuration parameter to 0 to disable the conversion done by Vertica. (See General parameters.)

7 - JSON data

Use FJSONPARSER to load data in JSON format.

The schema for JSON data is the set of property names in the property/value pairs. When you load JSON data into a columnar table or materialized columns in a Flex table, the property names in the data must match the column names in the table. You do not need to load all of the columns in the data.

The JSON parser can load data into columns of any scalar type, strongly-typed complex type, or flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.

In the COPY statement, use the PARSER parameter to specify the JSON parser as in the following example:

=> CREATE EXTERNAL TABLE customers(id INT, address VARCHAR, transactions ARRAY[INT,10])
    AS COPY FROM 'cust.json' PARSER FJSONPARSER();

This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.

Before loading JSON data, consider using a tool such as JSONLint to verify that the data is valid.

If you load JSON data into a Flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data. You can use Flex functions to extract values.

Strongly-typed complex types

JSON data can contain arrays, structs, and combinations of the two. You can load this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.

Use the ARRAY and ROW types in the table definition as usual:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city ARRAY[VARCHAR(80),50],
   menu ARRAY[ ROW(item VARCHAR(80), price FLOAT), 100 ]
  )
 AS COPY FROM :restdata PARSER FJSONPARSER();

=> SELECT name, location_city, menu FROM rest;
       name        |       location_city        |                                     menu
-------------------+----------------------------+------------------------------------------------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | [{"item":"cheese pizza","price":8.25},{"item":"spinach pizza","price":10.5}]
 Bakersfield Tacos | ["Pittsburgh"]             | [{"item":"veggie taco","price":9.95},{"item":"steak taco","price":10.95}]
(2 rows)

When loading JSON data into a table with strong typing for complex types, Vertica ignores the parser's flatten_maps and flatten_arrays parameters.

Strong and flexible complex types

An advantage of strong typing is the easier (and more efficient) access in queries. A disadvantage is that additional values found in the data but not included in the column definition are ignored. If the menu struct in this data includes more attributes, such as calories, they are not loaded because the definition of the column only specified item and price. The following example uses flexible complex types to reveal the extra attributes:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city LONG VARBINARY, menu LONG VARBINARY)
 AS COPY FROM :restdata
 PARSER FJSONPARSER(flatten_maps=false);

=> SELECT name, MAPTOSTRING(location_city) as location_city, MAPTOSTRING(menu) AS menu FROM rest;
       name        |       location_city      |                     menu
-------------------+--------------------------+---------------------------------------------------
 Bob's pizzeria    | {
    "0": "Cambridge",
    "1": "Pittsburgh"
} | {
    "0": {
        "calories": "1200",
        "item": "cheese pizza",
        "price": "8.25"
    },
    "1": {
        "calories": "900",
        "item": "spinach pizza",
        "price": "10.50"
    }
}
 Bakersfield Tacos | {
    "0": "Pittsburgh"
}                    | {
    "0": {
        "item": "veggie taco",
        "price": "9.95",
        "vegetarian": "true"
    },
    "1": {
        "item": "steak taco",
        "price": "10.95"
    }
}
(2 rows)

Unmatched fields

The JSON parser loads all fields from the data that are part of the table definition. If the data contains other fields, the parser produces a warning and logs the new fields in a system table. You can review the logged events and decide whether to modify your table definition. For details, see Schema evolution.

Loading from a specific start point

You need not load an entire JSON file. You can use the start_point parameter to load data beginning at a specific key, rather than at the beginning of a file. Data is parsed from after the start_point key until the end of the file, or to the end of the first start_point's value. The parser ignores any subsequent instance of the start_point, even if that key appears multiple times in the input file. If the input data contains only one copy of the start_point key, and that value is a list of JSON elements, the parser loads each element in the list as a row.

If a start_point value occurs more than once in your JSON data, you can use the start_point_occurrence integer parameter to specify the occurrence at which to start parsing.

This example uses the following JSON data, saved to a file named alphanums.json:

 { "A": { "B": { "C": [ { "d": 1, "e": 2, "f": 3 }, { "g": 4, "h": 5, "i": 6 },
{ "j": 7, "k": 8, "l": 9 } ] } } }

Loading this data into a flex table produces the following results:

=> CREATE FLEX TABLE start_json;
CREATE TABLE

=> COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           1
(1 row)

=> SELECT maptostring(__raw__) FROM start_json;
                        maptostring
-------------------------------------------------------------------------------
 {
   "A.B.C" : {
      "0.d" : "1",
      "0.e" : "2",
      "0.f" : "3",
      "1.g" : "4",
      "1.h" : "5",
      "1.i" : "6",
      "2.j" : "7",
      "2.k" : "8",
      "2.l" : "9"
   }
}

(1 row)

The following load specifies a start point:

=> TRUNCATE TABLE start_json;
TRUNCATE TABLE

=>  COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER(start_point='B');
 Rows Loaded
-------------
           1
(1 row)

=> SELECT maptostring(__raw__) FROM start_json;
                         maptostring
--------------------------------------------------------------------------------
 {
   "C" : {
      "0.d" : "1",
      "0.e" : "2",
      "0.f" : "3",
      "1.g" : "4",
      "1.h" : "5",
      "1.i" : "6",
      "2.j" : "7",
      "2.k" : "8",
      "2.l" : "9"
   }
}
(1 row)

Dealing with invalid JSON records

If your JSON data contains syntax errors, your load can fail due to invalid records. You can use the RECORD_TERMINATOR option in the COPY statement to skip these invalid records if your JSON records are consistently delimited by a character like a line break. Setting a record terminator allows the parser to skip over invalid records and continue parsing the rest of the data.

If your records are not consistently marked by a character, you can use the ERROR TOLERANCE option. ERROR TOLERANCE skips entire source files with invalid JSON records, while RECORD_TERMINATOR skips individual malformed JSON records. You can use the two options together.

The following example uses invalid records:

=> => CREATE FLEX TABLE fruits();
CREATE TABLE

=> COPY fruits FROM STDIN PARSER FJSONPARSER(RECORD_TERMINATOR=E'\n');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself
>> {"name": "orange", "type": "fruit", "color": "orange", "rating": 5 }
>> {"name": "apple", "type": "fruit", "color": "green" }
>> {"name": "blueberry", "type": "fruit", "color": "blue", "rating": 10 }
>> "type": "fruit", "rating": 7 }
>> {"name": "banana", "type" : "fruit", "color": "yellow", "rating": 3 }
>> \.

View the flex table using MAPTOSTRING to confirm that the invalid record was skipped while the rest of the records were successfully loaded:

=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
--------------------------------------------------------------------------------------------
{
"color" : "orange",
"name" : "orange",
"rating" : "5",
"type" : "fruit"
}
{
"color" : "green",
"name" : "apple",
"type" : "fruit"
}
{
"color" : "blue",
"name" : "blueberry",
"rating" : "10",
"type" : "fruit"
}
{
"color" : "yellow",
"name" : "banana",
"rating" : "3",
"type" : "fruit"
}
(4 rows)

Rejecting data on materialized column type errors

By default, if FJSONPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error parameter. If this parameter is true, COPY rejects such rows and reports an error.

If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types; if any ROW field or ARRAY element cannot be coerced, it loads NULL for the column.

The following example attempts to load invalid data. Note that the invalid row is missing from the query results:

=> CREATE TABLE test(one VARCHAR, two INT);
CREATE TABLE

=> COPY test FROM stdin
   PARSER FJSONPARSER(reject_on_materialized_type_error=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"one": 1, "two": 2}
>> {"one": "one", "two": "two"}
>> {"one": "one", "two": 2}
>> \.

=> SELECT one, two FROM test;
 one | two
-----+-----
 1   |   2
 one |   2
(2 rows)

Rejecting or omitting empty keys in flex tables

Valid JSON files can include empty key and value pairs. By default, for a Flex table, FJSONPARSER loads them, as in the following example:

=> CREATE FLEX TABLE fruits();
CREATE TABLE

=> COPY fruits FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "orange", "rating": 5}
>> {"name": "apple", "rating" : 10}
>> {"": "banana", "rating" : 2}
>> \.

=> SELECT MAPTOSTRING(__raw__) FROM fruits;
              maptostring
---------------------------------------
 {
        "name": "orange",
        "rating": "5"
}
 {
        "name": "apple",
        "rating": "10"
}
 {
        "": "banana",
        "rating": "2"
}
(3 rows)

To omit fields with empty keys, use the omit_empty_keys parameter:

=> COPY fruits FROM STDIN PARSER FJSONPARSER(omit_empty_keys=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "apple", "rating": 5}
>> {"": "missing", "rating" : 1}
>> {"name": "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
             maptostring
--------------------------------------
 {
        "name": "apple",
        "rating": "5"
}
 {
        "rating": "1"
}
 {
        "name": "",
        "rating": "3"
}
(3 rows)

Note that the second value, with the missing name, still loads with the other (non-empty) field. To instead reject the row entirely, use the reject_on_empty_key parameter:

=> COPY fruits FROM STDIN PARSER FJSONPARSER(reject_on_empty_key=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name" : "apple", "rating" : 5}
>> {"" : "missing", "rating" : 1}
>> {"name" : "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
             maptostring
--------------------------------------
 {
        "name": "apple",
        "rating": "5"
}
 {
        "name": "",
        "rating": "3"
}
(2 rows)

8 - Avro data

Use FAVROPARSER to load Avro data files.

Use FAVROPARSER to load Avro data files. This parser supports both columnar and Flex tables.

A column can be of any scalar type, a strongly-typed complex type, or a flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.

The following requirements apply:

Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.
The Avro file must have its related schema in the file being loaded; FAVROPARSER does not support Avro files with separate schema files.

In the COPY statement, use the PARSER parameter to specify the Avro parser as in the following example:

=> COPY weather FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER()

This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.

Avro schemas and columnar tables

Avro includes the schema with the data. When you load Avro data into a columnar table, the column names in the schema in the data must match the column names in the table. You do not need to load all of the columns in the data.

For example, the following Avro schema uses the Avro record type to represent a user profile:

{
  "type": "record",
  "name": "Profile",
  "fields" : [
      {"name": "UserName", "type": "string"},
      {"name": "Email", "type": "string"},
      {"name": "Address", "type": "string"}
   ]
}

To successfully load the data into a columnar table with this schema, each target column name must match the "name" value in the schema. In the following example, the profiles table does not load values corresponding to the schema's Email field because the target column is named EmailAddr:

=> COPY profiles FROM '/home/dbadmin/data/user_profile.avro' PARSER FAVROPARSER();

=> SELECT * FROM profiles;
     UserName    |     EmailAddr      |       Address
-----------------+--------------------+---------------------
    dbadmin      |                    |    123 Main St.

Strongly-typed complex types

Avro data can contain arrays, structs, and combinations of the two. You can read this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.

Use the ARRAY and ROW types in the table definition as usual:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city ARRAY[VARCHAR(80)],
   menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
  )
AS COPY FROM :avro_file PARSER FAVROPARSER();

You can use strong typing in both external tables and native tables.

An alternative to strong typing for complex types is flexible complex types.

Unmatched fields

The Avro parser loads all fields from the data that are part of the table definition. If the data contains other fields, the parser produces a warning and logs the new fields in a system table. You can review the logged events and decide whether to modify your table definition. For details, see Schema evolution.

Rejecting data on materialized column type errors

By default, if FAVROPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error parameter. If this parameter is true, COPY rejects such rows and reports an error.

If a flex table has a materialized column, the data being loaded must be coercible to that column's type. For example, if a materialized column is declared as a FLOAT and you try to load a VARCHAR value for that key, FAVROPARSER rejects the data row.

9 - Matches from regular expressions

You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser.

You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser. This section describes some examples of using the options that the flex parsers support.

Sample regular expression

These examples use the following regular expression, which searches information that includes the timestamp, date, thread_name, and thread_id strings.

Caution

For display purposes, this sample regular expression adds new line characters to split long lines of text. To use this expression in a query, first copy and edit the example to remove any new line characters.

This example expression loads any thread_id hex value, regardless of whether it has a 0x prefix, (<thread_id>(?:0x)?[0-9a-f]+).

'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
 (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'

Using regular expression matches for a flex table

You can load the results from a regular expression into a flex table, using the fregexparser. For a complete example of doing so, see FREGEXPARSER.

Using fregexparser for columnar tables

This section illustrates how to load the results of a regular expression used with a sample log file for a Vertica database. By using an external table definition, the section presents an example of using fregexparser to load data into a columnar table. Using a flex table parser for a columnar tables gives you the capability to mix data loads in one table. For example, you can load the results of a regular expression in one session, and JSON data in another.

The following basic examples illustrate this usage.

Create a columnar table, vlog, with the following columns:

=> CREATE TABLE vlog (
    "text"             varchar(2322),
    thread_id         varchar(28),
    thread_name     varchar(44),
    "time"             varchar(46),
    component         varchar(30),
    level             varchar(20),
    transaction_id     varchar(32),
    elevel             varchar(20),
    enode             varchar(34)
);

Use COPY to load parts of a log file using the sample regular expression presented above, with the fregexparser. Be sure to remove any line characters from this expression example before trying it yourself:

=> COPY v_log FROM '/home/dbadmin/data/flex/vertica.log' PARSER
FRegexParser(pattern='^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
 (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'


) rejected data as table fregex_reject;

Query the time column:

=> SELECT time FROM flogs limit 10;
          time
-------------------------
 2014-04-02 04:02:02.613
 2014-04-02 04:02:02.613
 2014-04-02 04:02:02.614
 2014-04-02 04:02:51.008
 2014-04-02 04:02:51.010
 2014-04-02 04:02:51.012
 2014-04-02 04:02:51.012
 2014-04-02 04:02:51.013
 2014-04-02 04:02:51.014
 2014-04-02 04:02:51.017
(10 rows)

Using external tables with fregexparser

By creating an external columnar table for your Vertica log file, querying the table will return updated log information. The following basic example illustrate this usage.

Create a columnar table, vertica_log, using the AS COPY clause and fregexparser to load matched results from the regular expression. For illustrative purposes, this regular expression has new line characters to split long text lines. Remove any line returns before testing with this expression:

=> CREATE EXTERNAL TABLE public.vertica_log
(
    "text" varchar(2322),
    thread_id varchar(28),
    thread_name varchar(44),
    "time" varchar(46),
    component varchar(30),
    level varchar(20),
    transaction_id varchar(32),
    elevel varchar(20),
    enode varchar(34)
)
AS COPY
FROM '/home/dbadmin/data/vertica.log'
PARSER FRegexParser(pattern='^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
 (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'


);

Query from the external table to get updated results:

=> SELECT component, thread_id, time FROM vertica_log limit 10;
 component | thread_id  |          time
-----------+------------+-------------------------
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
 Init      | 0x16321430 | 2014-04-02 04:02:02.613
           | 0x16321430 | 2014-04-02 04:02:02.614
           | 0x16321430 | 2014-04-02 04:02:02.614
           | 0x16321430 | 2014-04-02 04:02:02.614
(10 rows)

10 - Common event format (CEF) data

Use the flex parser fcefparser to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.

Use the flex parser fcefparser to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.

When you use the parser to load arbitrary CEF-format files, it interprets key names in the data as virtual columns in your flex table. After loading, you can query your CEF data directly, regardless of which set of keys exist in each row. You can also use the associated flex table data and map functions to manage CEF data access.

Create a flex table and load CEF data

This section uses a sample set of CEF data. All IP addresses have been purposely changed to be inaccurate, and Return characters added for illustration.

To use this sample data, copy the following text and remove all Return characters. Save the file as CEF_sample.cef, which is the name used throughout these examples.

CEF:0|ArcSight|ArcSight|6.0.3.6664.0|agent:030|Agent [test] type [testalertng] started|Low|
eventId=1 mrt=1396328238973 categorySignificance=/Normal categoryBehavior=/Execute/Start
categoryDeviceGroup=/Application catdt=Security Mangement categoryOutcome=/Success
categoryObject=/Host/Application/Service art=1396328241038 cat=/Agent/Started
deviceSeverity=Warning rt=1396328238937 fileType=Agent
cs2=<Resource ID\="3DxKlG0UBABCAA0cXXAZIwA\=\="/> c6a4=fe80:0:0:0:495d:cc3c:db1a:de71
cs2Label=Configuration Resource c6a4Label=Agent
IPv6 Address ahost=SKEELES10 agt=888.99.100.1 agentZoneURI=/All Zones/ArcSight
System/Private Address Space
Zones/RFC1918: 888.99.0.0-888.200.255.255 av=6.0.3.6664.0 atz=Australia/Sydney
aid=3DxKlG0UBABCAA0cXXAZIwA\=\= at=testalertng dvchost=SKEELES10 dvc=888.99.100.1
deviceZoneURI=/All Zones/ArcSight System/Private Address Space Zones/RFC1918:
888.99.0.0-888.200.255.255 dtz=Australia/Sydney _cefVer=0.1

Create a flex table logs:

=> CREATE FLEX TABLE logs();
CREATE TABLE

Load the sample CEF file, using the flex parser fcefparser:

=> COPY logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
 Rows Loaded
-------------
           1
(1 row)

Use the maptostring() function to see the contents of the logs flex table:

=> SELECT maptostring(__raw__) FROM logs;
                                  maptostring
-------------------------------------------------------------------------------------
  {
   "_cefver" : "0.1",
   "agentzoneuri" : "/All Zones/ArcSight System/Private Address
    Space Zones/RFC1918: 888.99.0.0-888.200.255.255",
   "agt" : "888.99.100.1",
   "ahost" : "SKEELES10",
   "aid" : "3DxKlG0UBABCAA0cXXAZIwA==",
   "art" : "1396328241038",
   "at" : "testalertng",
   "atz" : "Australia/Sydney",
   "av" : "6.0.3.6664.0",
   "c6a4" : "fe80:0:0:0:495d:cc3c:db1a:de71",
   "c6a4label" : "Agent IPv6 Address",
   "cat" : "/Agent/Started",
   "catdt" : "Security Mangement",
   "categorybehavior" : "/Execute/Start",
   "categorydevicegroup" : "/Application",
   "categoryobject" : "/Host/Application/Service",
   "categoryoutcome" : "/Success",
   "categorysignificance" : "/Normal",
   "cs2" : "<Resource ID=\"3DxKlG0UBABCAA0cXXAZIwA==\"/>",
   "cs2label" : "Configuration Resource",
   "deviceproduct" : "ArcSight",
   "deviceseverity" : "Warning",
   "devicevendor" : "ArcSight",
   "deviceversion" : "6.0.3.6664.0",
   "devicezoneuri" : "/All Zones/ArcSight System/Private Address Space
    Zones/RFC1918: 888.99.0.0-888.200.255.255",
   "dtz" : "Australia/Sydney",
   "dvc" : "888.99.100.1",
   "dvchost" : "SKEELES10",
   "eventid" : "1",
   "filetype" : "Agent",
   "mrt" : "1396328238973",
   "name" : "Agent [test] type [testalertng] started",
   "rt" : "1396328238937",
   "severity" : "Low",
   "signatureid" : "agent:030",
   "version" : "0"
}

(1 row)

Create a columnar table and load CEF data

This example lets you compare the flex table for CEF data with a columnar table. You do so by creating a new table and load the same CEF_sample.cef file used in the preceding flex table example.

Create a columnar table, col_logs, defining the prefix names that are hard coded in fcefparser:

=> CREATE TABLE col_logs(version INT,
  devicevendor VARCHAR,
  deviceproduct VARCHAR,
  deviceversion VARCHAR,
  signatureid VARCHAR,
  name VARCHAR,
  severity VARCHAR);
CREATE TABLE

Load the sample file into col_logs, as you did for the flex table:

=> COPY col_logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
 Rows Loaded
-------------
           1
(1 row)

Query the table. You can find the identical information in the flex table output.

=> \x
Expanded display is on.
VMart=> SELECT * FROM col_logs;
-[ RECORD 1 ]-+----------------------------------------
version       | 0
devicevendor  | ArcSight
deviceproduct | ArcSight
deviceversion | 6.0.3.6664.0
signatureid   | agent:030
name          | Agent [test] type [testalertng] started
severity      | Low

Compute keys and build a flex table view

In this example, you use a flex helper function to compute keys and build a view for the logs flex table.

Use the compute_flextable_keys_and_build_view function to compute keys and populate a view generated from the logs flex table:

=> SELECT compute_flextable_keys_and_build_view('logs');
                            compute_flextable_keys_and_build_view
-------------------------------------------------------------------------------------
 Please see public.logs_keys for updated keys
The view public.logs_view is ready for querying
(1 row)

Query the logs_keys table to see what the function computed from the sample CEF data:

=> SELECT * FROM logs_keys;
       key_name       | frequency | data_type_guess
----------------------+-----------+-----------------
 c6a4                 |         1 | varchar(60)
 c6a4label            |         1 | varchar(36)
 categoryobject       |         1 | varchar(50)
 categoryoutcome      |         1 | varchar(20)
 categorysignificance |         1 | varchar(20)
 cs2                  |         1 | varchar(84)
 cs2label             |         1 | varchar(44)
 deviceproduct        |         1 | varchar(20)
 deviceversion        |         1 | varchar(24)
 devicezoneuri        |         1 | varchar(180)
 dvchost              |         1 | varchar(20)
 version              |         1 | varchar(20)
 ahost                |         1 | varchar(20)
 art                  |         1 | varchar(26)
 at                   |         1 | varchar(22)
 cat                  |         1 | varchar(28)
 catdt                |         1 | varchar(36)
 devicevendor         |         1 | varchar(20)
 dtz                  |         1 | varchar(32)
 dvc                  |         1 | varchar(24)
 filetype             |         1 | varchar(20)
 mrt                  |         1 | varchar(26)
 _cefver              |         1 | varchar(20)
 agentzoneuri         |         1 | varchar(180)
 agt                  |         1 | varchar(24)
 aid                  |         1 | varchar(50)
 atz                  |         1 | varchar(32)
 av                   |         1 | varchar(24)
 categorybehavior     |         1 | varchar(28)
 categorydevicegroup  |         1 | varchar(24)
 deviceseverity       |         1 | varchar(20)
 eventid              |         1 | varchar(20)
 name                 |         1 | varchar(78)
 rt                   |         1 | varchar(26)
 severity             |         1 | varchar(20)
 signatureid          |         1 | varchar(20)
(36 rows)

Query several columns from the logs_view:

=> \x
Expanded display is on.
VMart=> select version, devicevendor, deviceversion, name, severity, signatureid
  from logs_view;
-[ RECORD 1 ]-+----------------------------------------
version       | 0
devicevendor  | ArcSight
deviceversion | 6.0.3.6664.0
name          | Agent [test] type [testalertng] started
severity      | Low
signatureid   | agent:030

Use the fcefparser delimiter parameter

In this example, you use the fcefparser delimiter parameter to query events located in California, New Mexico, and Arizona.

Create a new columnar table, CEFData3:

=> CREATE TABLE CEFData3(eventId INT, location VARCHAR(20));
CREATE TABLE

Using the delimiter=',' parameter, load some CEF data into the table:

=> COPY CEFData3 FROM stdin PARSER fcefparser(delimiter=',');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> eventId=1,location=California
>> eventId=2,location=New Mexico
>> eventId=3,location=Arizona
>> \.

Query the table:


=> SELECT eventId, location FROM CEFData3;
 eventId |  location
---------+------------
       1 | California
       2 | New Mexico
       3 | Arizona
(3 rows)

Data formats

1 - Delimited data

Changing the column separator (DELIMITER)

Changing collection delimiters (COLLECTIONDELIMITER, COLLECTIONOPEN, COLLECTIONCLOSE)

Changing the character enclosing column or collection values (ENCLOSED BY, COLLECTIONENCLOSE)

Changing the null indicator (NULL)

Changing the null indicator for collection values (COLLECTIONNULLELEMENT)

Filling missing columns (TRAILING NULLCOLS)

Changing the escape character (ESCAPE AS, NO ESCAPE)

Changing the end-of-line character (RECORD TERMINATOR)

Note

2 - Binary (native) data

Loading hexadecimal, octal, and bitstring data

Note

Hexadecimal data

Octal data

BitString data

Examples

See also

3 - Native varchar data

Note

4 - Fixed-width format data

Using nulls in fixed-width data

Note

Defining a null character (statement level)

Defining a custom record terminator

Copying fixed-width data

Skipping content in fixed-width data

Trimming characters in fixed-width data loads

Using padding in fixed-width data loads

5 - ORC data

Schema matching

Supported data types

Timestamps and time zones

6 - Parquet data

Schema matching

Metadata caching

Supported data types

Timestamps and time zones

7 - JSON data

Strongly-typed complex types

Strong and flexible complex types

Unmatched fields

Loading from a specific start point

Dealing with invalid JSON records

Rejecting data on materialized column type errors

Rejecting or omitting empty keys in flex tables

8 - Avro data

Avro schemas and columnar tables

Strongly-typed complex types

Unmatched fields

Rejecting data on materialized column type errors

See also

9 - Matches from regular expressions

Sample regular expression

Caution

Using regular expression matches for a flex table

Using fregexparser for columnar tables

Using external tables with fregexparser

10 - Common event format (CEF) data

Create a flex table and load CEF data

Create a columnar table and load CEF data

Compute keys and build a flex table view

Use the fcefparser delimiter parameter