Appendix: creating native binary format files

Using COPY to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix.

1: File signature
2: Column definitions
3: Row data
4: Loading a NATIVE file into a table: example

Using COPY to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix. All NATIVE files must contain:

The subsection Loading a NATIVE file into a table: example describes an example of loading data with the NATIVE parser.

Note

You cannot mix Binary and ASCII source files in the same COPY statement.

1 - File signature

The first part of a NATIVE binary file consists of a file signature.

The first part of a NATIVE binary file consists of a file signature. The contents of the signature are fixed, and listed in the following table.

Byte Offset	`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`	`10`
Hex Value	`4E`	`41`	`54`	`49`	`56`	`45`	`0A`	`FF`	`0D`	`0A`	`00`
Text Literals	`N`	`A`	`T`	`I`	`V`	`E`	`E'\n'`	`E'\317'`	`E'\r'`	`E'\n'`	`E'\000'`

The signature ensures that the file has neither been corrupted by a non-8-bit file transfer, nor stripped of carriage returns, linefeeds, or null values. If the signature is intact, Vertica determines that the file has not been corrupted.

2 - Column definitions

Following the file signature, the file must define the widths of each column in the file as follows.

Byte Offset	Length (bytes)	Description	Comments
11	4	Header area length	32-bit integer in little-endian format that contains the length in bytes of remaining in the header, not including itself. This is the number of bytes from the end of this value to the start of the row data.
15	2	NATIVE file version	16-bit integer in little-endian format containing the version number of the NATIVE file format. The only valid value is currently 1. Future changes to the format could be assigned different version numbers to maintain backward compatibility.
17	1	Filler	Always 0.
18	2	Number of columns	16-bit integer in little-endian format that contains the number of columns in each row in the file.
20+	4 bytes for each column of data in the table	Column widths	Array of 32-bit integers in little-endian format that define the width of each column in the row. Variable-width columns have a value of -1 (0xFF 0xFF 0xFF 0xFF).

Note

All integers in NATIVE files are in little-endian format (least significant byte first).

The width of each column is determined by the data type it contains. The following table explains the column width needed for each data type, along with the data encoding.

Data Type	Length (bytes)	Column Content
INTEGER	1, 2, 4, 8	8-, 16-, 32-, and 64-bit integers are supported. All multi-byte values are stored in little-endian format. Note: All values for a column must be the width you specify here. If you set the length of an INTEGER column to be 4 bytes, then all of the values you supply for that column must be 32-bit integers.
BOOLEAN	1	0 for false, 1 for true.
FLOAT	8	Encoded in IEEE-754 format.
CHAR	User-specified	Strings shorter than the specified length must be right-padded with spaces (E'\040'). Strings are not null-terminated. Character encoding is UTF-8. UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.
VARCHAR	4-byte integer (length) + data	The column width for a VARCHAR column is always -1 to signal that it contains variable-length data. Each VARCHAR column value starts with a 32-bit integer that contains the number of bytes in the string. The string must not be null-terminated. Character encoding must be UTF-8. Remember that UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.
DATE	8	64-bit integer in little-endian format containing the Julian day since Jan 01 2000 (J2451545)
TIME	8	64-bit integer in little-endian format containing the number of microseconds since midnight in the UTC time zone.
TIMETZ	8	64-bit value where Upper 40 bits contain the number of microseconds since midnight. Lower 24 bits contain time zone as the UTC offset in microseconds calculated as follows: Time zone is logically from -24hrs to +24hrs from UTC. Instead it is represented here as a number between 0hrs to 48hrs. Therefore, 24hrs should be added to the actual time zone to calculate it. Each portion is stored in little-endian format (5 bytes followed by 3 bytes).
TIMESTAMP	8	64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00.
TIMESTAMPTZ	8	A 64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00 in the UTC timezone.
INTERVAL	8	64-bit integer in little-endian format containing the number of microseconds in the interval.
BINARY	User-specified	Similar to CHAR. The length should be specified in the file header in the Field Lengths entry for the field. The field in the record must contain length number of bytes. If the value is smaller than the specified length, the remainder should be filled with nulls (E'\000').
VARBINARY	4-byte integer + data	Stored just like VARCHAR but data is interpreted as bytes rather than UTF-8 characters.
NUMERIC	(precision, scale) (precision ¸ 19 + 1) ´ 8 rounded up	A constant-length data type. Length is determined by the precision, assuming that a 64-bit unsigned integer can store roughly 19 decimal digits. The data consists of a sequence of 64-bit integers, each stored in little-endian format, with the most significant integer first. Data in the integers is stored in base 2⁶⁴. 2's complement is used for negative numbers. If there is a scale, then the numeric is stored as numeric ´ 10^scale; that is, all real numbers are stored as integers, ignoring the decimal point. It is required that the scale matches that of the target column in the dataanchor table. Another option is to use FILLER columns to coerce the numeric to the scale of the target column.

3 - Row data

Following the file header is a sequence of records that contain the data for each row of data.

Following the file header is a sequence of records that contain the data for each row of data. Each record starts with a header:

Length (bytes) Description Comments

4

Row length

Length (bytes)	Description	Comments
4	Row length	A 32-bit integer in little-endian format containing the length of the row's data in bytes. It includes the size of data only, not the header. Note: The number of bytes in each row can vary not only because of variable-length data, but also because columns containing NULL values do not have any data in the row. If column 3 has a NULL value, then column 4's data immediately follows the end of column 2's data. See the next
Number of columns ¸ 8 rounded up (`CEILING(NumFields / ( sizeof(uint8) * 8) );` )	Null value bit field	A series of bytes whose bits indicate whether a column contains a NULL. The most significant bit of the first byte indicates whether the first column in this row contains a NULL, the next most significant bit indicates whether the next column contains a NULL, and so on. If a bit is 1 (true) then the column contains a NULL, and there is no value for the column in the data for the row.

A 32-bit integer in little-endian format containing the length of the row's data in bytes. It includes the size of data only, not the header.

Note: The number of bytes in each row can vary not only because of variable-length data, but also because columns containing NULL values do not have any data in the row. If column 3 has a NULL value, then column 4's data immediately follows the end of column 2's data. See the next

Number of columns ¸ 8 rounded up (CEILING(NumFields / ( sizeof(uint8) * 8) ); ) Null value bit field A series of bytes whose bits indicate whether a column contains a NULL. The most significant bit of the first byte indicates whether the first column in this row contains a NULL, the next most significant bit indicates whether the next column contains a NULL, and so on. If a bit is 1 (true) then the column contains a NULL, and there is no value for the column in the data for the row.

Following the record header is the column values for the row. There is no separator characters for these values. Their location in the row of data is calculated based on where the previous column's data ended. Most data types have a fixed width, so their location is easy to determine. Variable-width values (such as VARCHAR and VARBINARY) start with a count of the number of bytes the value contains.

See the table in the previous section for details on how each data type's value is stored in the row's data.

4 - Loading a NATIVE file into a table: example

The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data.

The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data. The table contains all possible data types.

=> CREATE TABLE allTypes (INTCOL INTEGER,
                          FLOATCOL FLOAT,
                          CHARCOL CHAR(10),
                          VARCHARCOL VARCHAR,
                          BOOLCOL BOOLEAN,
                          DATECOL DATE,
                          TIMESTAMPCOL TIMESTAMP,
                          TIMESTAMPTZCOL TIMESTAMPTZ,
                          TIMECOL TIME,
                          TIMETZCOL TIMETZ,
                          VARBINCOL VARBINARY,
                          BINCOL BINARY,
                          NUMCOL NUMERIC(38,0),
                          INTERVALCOL INTERVAL
                         );
=> COPY allTypes FROM '/home/dbadmin/allTypes.bin' NATIVE;
=> \pset expanded
Expanded display is on.
=> SELECT * from allTypes;
-[ RECORD 1 ]--+------------------------
INTCOL         | 1
FLOATCOL       | -1.11
CHARCOL        | one
VARCHARCOL     | ONE
BOOLCOL        | t
DATECOL        | 1999-01-08
TIMESTAMPCOL   | 1999-02-23 03:11:52.35
TIMESTAMPTZCOL | 1999-01-08 07:04:37-05
TIMECOL        | 07:09:23
TIMETZCOL      | 15:12:34-04
VARBINCOL      | \253\315
BINCOL         | \253
NUMCOL         | 1234532
INTERVALCOL    | 03:03:03

The content of the allTypes.bin file appears below as a raw hex dump:

4E 41 54 49 56 45 0A FF 0D 0A 00 3D 00 00 00 01 00 00 0E 00
08 00 00 00 08 00 00 00 0A 00 00 00 FF FF FF FF 01 00 00 00
08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00
FF FF FF FF 03 00 00 00 18 00 00 00 08 00 00 00 73 00 00 00
00 00 01 00 00 00 00 00 00 00 C3 F5 28 5C 8F C2 F1 BF 6F 6E
65 20 20 20 20 20 20 20 03 00 00 00 4F 4E 45 01 9A FE FF FF
FF FF FF FF 30 85 B3 4F 7E E7 FF FF 40 1F 3E 64 E8 E3 FF FF
C0 2E 98 FF 05 00 00 00 D0 97 01 80 F0 79 F0 10 02 00 00 00
AB CD AB CD 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 64 D6 12 00 00 00 00 00 C0 47 A3 8E 02 00 00 00

The following table breaks this file down into each of its components, and describes the values it contains.

Hex Values	Description	Value
`4E 41 54 49 56 45 0A FF 0D 0A 00`	Signature	`NATIVE\n\317\r\n\000`
`3D 00 00 00`	Header area length	61 bytes
`01 00`	Native file format version	Version 1
`00`	Filler value	0
`0E 00`	Number of columns	14 columns
`08 00 00 00`	Width of column 1 (INTEGER)	8 bytes
`08 00 00 00`	Width of column 2 (FLOAT)	8 bytes
`0A 00 00 00`	Width of column 3 (CHAR(10))	10 bytes
`FF FF FF FF`	Width of column 4 (VARCHAR)	-1 (variable width column)
`01 00 00 00`	Width of column 5 (BOOLEAN)	1 bytes
`08 00 00 00`	Width of column 6 (DATE)	8 bytes
`08 00 00 00`	Width of column 7 (TIMESTAMP)	8 bytes
`08 00 00 00`	Width of column 8 (TIMESTAMPTZ)	8 bytes
`08 00 00 00`	Width of column 9 (TIME)	8 bytes
`08 00 00 00`	Width of column 10 (TIMETZ)	8 bytes
`FF FF FF FF`	Width of column 11 (VARBINARY)	-1 (variable width column)
`03 00 00 00`	Width of column 12 (BINARY)	3 bytes
`18 00 00 00`	Width of column 13 (NUMERIC)	24 bytes. The size is calculated by dividing 38 (the precision specified for the numeric column) by 19 (the number of digits each 64-bit chunk can represent) and adding 1. 38 ¸ 19 + 1 = 3. then multiply by eight to get the number of bytes needed. 3 ´ 8 = 24 bytes.
`08 00 00 00`	Width of column 14 (INTERVAL). last portion of the header section.	8 bytes
`73 00 00 00`	Number of bytes of data for the first row. this is the start of the first row of data.	115 bytes
`00 00`	Bit field for the null values contained in the first row of data	The row contains no null values.
`01 00 00 00 00 00 00 00`	Value for 64-bit INTEGER column	1
`C3 F5 28 5C 8F C2 F1 BF`	Value for the FLOAT column	-1.11
`6F 6E 65 20 20 20 20 20 20 20`	Value for the CHAR(10) column	`"one "` (padded With 7 spaces to fill the full 10 characters for the column)
`03 00 00 00`	The number of bytes in the following VARCHAR value.	3 bytes
`4F 4E 45`	The value for the VARCHAR column	"`ONE`"
`01`	The value for the BOOLEAN column	True
`9A FE FF FF FF FF FF FF`	The value for the DATE column	1999-01-08
`30 85 B3 4F 7E E7 FF FF`	The value for the TIMESTAMP column	1999-02-23 03:11:52.35
`40 1F 3E 64 E8 E3 FF FF`	The value for the TIMESTAMPTZ column	1999-01-08 07:04:37-05
`C0 2E 98 FF 05 00 00 00`	The value for the TIME column	07:09:23
`D0 97 01 80 F0 79 F0 10`	The value for the TIMETZ column	15:12:34-05
`02 00 00 00`	The number of bytes in the following VARBINARY value	2 bytes
`AB CD`	The value for the VARBINARY column	Binary data (\253\315 as octal values)
`AB CD`	The value for the BINARY column	Binary data (\253\315 as octal values)
`00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 64 D6 12 00 00 00 00 00`	The value for the NUMERIC column	1234532
`C0 47 A3 8E 02 00 00 00`	The value for the INTERVAL column	03:03:03