This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Appendix: creating native binary format files

Using COPY to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix.

Using COPY to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix. All NATIVE files must contain:

The subsection Loading a NATIVE file into a table: example describes an example of loading data with the NATIVE parser.

1 - File signature

The first part of a NATIVE binary file consists of a file signature.

The first part of a NATIVE binary file consists of a file signature. The contents of the signature are fixed, and listed in the following table.

Byte Offset 0 1 2 3 4 5 6 7 8 9 10
Hex Value 4E 41 54 49 56 45 0A FF 0D 0A 00
Text Literals N A T I V E E'\n' E'\317' E'\r' E'\n' E'\000'

The signature ensures that the file has neither been corrupted by a non-8-bit file transfer, nor stripped of carriage returns, linefeeds, or null values. If the signature is intact, Vertica determines that the file has not been corrupted.

2 - Column definitions

Following the file signature, the file must define the widths of each column in the file as follows.

Following the file signature, the file must define the widths of each column in the file as follows.

Byte Offset Length (bytes) Description Comments
11 4 Header area length 32-bit integer in little-endian format that contains the length in bytes of remaining in the header, not including itself. This is the number of bytes from the end of this value to the start of the row data.
15 2 NATIVE file version 16-bit integer in little-endian format containing the version number of the NATIVE file format. The only valid value is currently 1. Future changes to the format could be assigned different version numbers to maintain backward compatibility.
17 1 Filler Always 0.
18 2 Number of columns 16-bit integer in little-endian format that contains the number of columns in each row in the file.
20+ 4 bytes for each column of data in the table Column widths Array of 32-bit integers in little-endian format that define the width of each column in the row. Variable-width columns have a value of -1 (0xFF 0xFF 0xFF 0xFF).

The width of each column is determined by the data type it contains. The following table explains the column width needed for each data type, along with the data encoding.

Data Type Length (bytes) Column Content
INTEGER 1, 2, 4, 8

8-, 16-, 32-, and 64-bit integers are supported. All multi-byte values are stored in little-endian format.

Note: All values for a column must be the width you specify here. If you set the length of an INTEGER column to be 4 bytes, then all of the values you supply for that column must be 32-bit integers.

BOOLEAN 1 0 for false, 1 for true.
FLOAT 8 Encoded in IEEE-754 format.
CHAR User-specified
  • Strings shorter than the specified length must be right-padded with spaces (E'\040').

  • Strings are not null-terminated.

  • Character encoding is UTF-8.

  • UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.

VARCHAR 4-byte integer (length) + data

The column width for a VARCHAR column is always -1 to signal that it contains variable-length data.

  • Each VARCHAR column value starts with a 32-bit integer that contains the number of bytes in the string.

  • The string must not be null-terminated.

  • Character encoding must be UTF-8.

  • Remember that UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.

DATE 8 64-bit integer in little-endian format containing the Julian day since Jan 01 2000 (J2451545)
TIME 8 64-bit integer in little-endian format containing the number of microseconds since midnight in the UTC time zone.
TIMETZ 8

64-bit value where

  • Upper 40 bits contain the number of microseconds since midnight.

  • Lower 24 bits contain time zone as the UTC offset in microseconds calculated as follows: Time zone is logically from -24hrs to +24hrs from UTC. Instead it is represented here as a number between 0hrs to 48hrs. Therefore, 24hrs should be added to the actual time zone to calculate it.

Each portion is stored in little-endian format (5 bytes followed by 3 bytes).

TIMESTAMP 8 64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00.
TIMESTAMPTZ 8 A 64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00 in the UTC timezone.
INTERVAL 8 64-bit integer in little-endian format containing the number of microseconds in the interval.
BINARY User-specified Similar to CHAR. The length should be specified in the file header in the Field Lengths entry for the field. The field in the record must contain length number of bytes. If the value is smaller than the specified length, the remainder should be filled with nulls (E'\000').
VARBINARY 4-byte integer + data Stored just like VARCHAR but data is interpreted as bytes rather than UTF-8 characters.
NUMERIC (precision, scale) (precision ¸ 19 + 1) ´ 8 rounded up

A constant-length data type. Length is determined by the precision, assuming that a 64-bit unsigned integer can store roughly 19 decimal digits. The data consists of a sequence of 64-bit integers, each stored in little-endian format, with the most significant integer first. Data in the integers is stored in base 264. 2's complement is used for negative numbers.

If there is a scale, then the numeric is stored as numeric ´ 10scale; that is, all real numbers are stored as integers, ignoring the decimal point. It is required that the scale matches that of the target column in the dataanchor table. Another option is to use FILLER columns to coerce the numeric to the scale of the target column.

3 - Row data

Following the file header is a sequence of records that contain the data for each row of data.

Following the file header is a sequence of records that contain the data for each row of data. Each record starts with a header:

Length (bytes) Description Comments
4 Row length

A 32-bit integer in little-endian format containing the length of the row's data in bytes. It includes the size of data only, not the header.

Note: The number of bytes in each row can vary not only because of variable-length data, but also because columns containing NULL values do not have any data in the row. If column 3 has a NULL value, then column 4's data immediately follows the end of column 2's data. See the next

Number of columns ¸ 8 rounded up (CEILING(NumFields / ( sizeof(uint8) * 8) ); ) Null value bit field A series of bytes whose bits indicate whether a column contains a NULL. The most significant bit of the first byte indicates whether the first column in this row contains a NULL, the next most significant bit indicates whether the next column contains a NULL, and so on. If a bit is 1 (true) then the column contains a NULL, and there is no value for the column in the data for the row.

Following the record header is the column values for the row. There is no separator characters for these values. Their location in the row of data is calculated based on where the previous column's data ended. Most data types have a fixed width, so their location is easy to determine. Variable-width values (such as VARCHAR and VARBINARY) start with a count of the number of bytes the value contains.

See the table in the previous section for details on how each data type's value is stored in the row's data.

4 - Loading a NATIVE file into a table: example

The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data.

The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data. The table contains all possible data types.

=> CREATE TABLE allTypes (INTCOL INTEGER,
                          FLOATCOL FLOAT,
                          CHARCOL CHAR(10),
                          VARCHARCOL VARCHAR,
                          BOOLCOL BOOLEAN,
                          DATECOL DATE,
                          TIMESTAMPCOL TIMESTAMP,
                          TIMESTAMPTZCOL TIMESTAMPTZ,
                          TIMECOL TIME,
                          TIMETZCOL TIMETZ,
                          VARBINCOL VARBINARY,
                          BINCOL BINARY,
                          NUMCOL NUMERIC(38,0),
                          INTERVALCOL INTERVAL
                         );
=> COPY allTypes FROM '/home/dbadmin/allTypes.bin' NATIVE;
=> \pset expanded
Expanded display is on.
=> SELECT * from allTypes;
-[ RECORD 1 ]--+------------------------
INTCOL         | 1
FLOATCOL       | -1.11
CHARCOL        | one
VARCHARCOL     | ONE
BOOLCOL        | t
DATECOL        | 1999-01-08
TIMESTAMPCOL   | 1999-02-23 03:11:52.35
TIMESTAMPTZCOL | 1999-01-08 07:04:37-05
TIMECOL        | 07:09:23
TIMETZCOL      | 15:12:34-04
VARBINCOL      | \253\315
BINCOL         | \253
NUMCOL         | 1234532
INTERVALCOL    | 03:03:03

The content of the allTypes.bin file appears below as a raw hex dump:

4E 41 54 49 56 45 0A FF 0D 0A 00 3D 00 00 00 01 00 00 0E 00
08 00 00 00 08 00 00 00 0A 00 00 00 FF FF FF FF 01 00 00 00
08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00
FF FF FF FF 03 00 00 00 18 00 00 00 08 00 00 00 73 00 00 00
00 00 01 00 00 00 00 00 00 00 C3 F5 28 5C 8F C2 F1 BF 6F 6E
65 20 20 20 20 20 20 20 03 00 00 00 4F 4E 45 01 9A FE FF FF
FF FF FF FF 30 85 B3 4F 7E E7 FF FF 40 1F 3E 64 E8 E3 FF FF
C0 2E 98 FF 05 00 00 00 D0 97 01 80 F0 79 F0 10 02 00 00 00
AB CD AB CD 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 64 D6 12 00 00 00 00 00 C0 47 A3 8E 02 00 00 00

The following table breaks this file down into each of its components, and describes the values it contains.

Hex Values Description Value
4E 41 54 49 56 45 0A FF 0D 0A 00 Signature NATIVE\n\317\r\n\000
3D 00 00 00 Header area length 61 bytes
01 00 Native file format version Version 1
00 Filler value 0
0E 00 Number of columns 14 columns
08 00 00 00 Width of column 1 (INTEGER) 8 bytes
08 00 00 00 Width of column 2 (FLOAT) 8 bytes
0A 00 00 00 Width of column 3 (CHAR(10)) 10 bytes
FF FF FF FF Width of column 4 (VARCHAR) -1 (variable width column)
01 00 00 00 Width of column 5 (BOOLEAN) 1 bytes
08 00 00 00 Width of column 6 (DATE) 8 bytes
08 00 00 00 Width of column 7 (TIMESTAMP) 8 bytes
08 00 00 00 Width of column 8 (TIMESTAMPTZ) 8 bytes
08 00 00 00 Width of column 9 (TIME) 8 bytes
08 00 00 00 Width of column 10 (TIMETZ) 8 bytes
FF FF FF FF Width of column 11 (VARBINARY) -1 (variable width column)
03 00 00 00 Width of column 12 (BINARY) 3 bytes
18 00 00 00 Width of column 13 (NUMERIC) 24 bytes. The size is calculated by dividing 38 (the precision specified for the numeric column) by 19 (the number of digits each 64-bit chunk can represent) and adding 1. 38 ¸ 19 + 1 = 3. then multiply by eight to get the number of bytes needed. 3 ´ 8 = 24 bytes.
08 00 00 00 Width of column 14 (INTERVAL). last portion of the header section. 8 bytes
73 00 00 00 Number of bytes of data for the first row. this is the start of the first row of data. 115 bytes
00 00 Bit field for the null values contained in the first row of data The row contains no null values.
01 00 00 00 00 00 00 00 Value for 64-bit INTEGER column 1
C3 F5 28 5C 8F C2 F1 BF Value for the FLOAT column -1.11
6F 6E 65 20 20 20 20 20 20 20 Value for the CHAR(10) column "one " (padded With 7 spaces to fill the full 10 characters for the column)
03 00 00 00 The number of bytes in the following VARCHAR value. 3 bytes
4F 4E 45 The value for the VARCHAR column "ONE"
01 The value for the BOOLEAN column True
9A FE FF FF FF FF FF FF The value for the DATE column 1999-01-08
30 85 B3 4F 7E E7 FF FF The value for the TIMESTAMP column 1999-02-23 03:11:52.35
40 1F 3E 64 E8 E3 FF FF The value for the TIMESTAMPTZ column 1999-01-08 07:04:37-05
C0 2E 98 FF 05 00 00 00 The value for the TIME column 07:09:23
D0 97 01 80 F0 79 F0 10 The value for the TIMETZ column 15:12:34-05
02 00 00 00 The number of bytes in the following VARBINARY value 2 bytes
AB CD The value for the VARBINARY column Binary data (\253\315 as octal values)
AB CD The value for the BINARY column Binary data (\253\315 as octal values)
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 64 D6 12 00 00 00 00 00 The value for the NUMERIC column 1234532
C0 47 A3 8E 02 00 00 00 The value for the INTERVAL column 03:03:03