This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Appendix: creating native binary format files
Using COPY to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix.
Using
COPY
to load data with the NATIVE parser requires that the input data files conform to the requirements described in this appendix. All NATIVE files must contain:
The subsection Loading a NATIVE file into a table: example describes an example of loading data with the NATIVE parser.
Note
You cannot mix Binary and ASCII source files in the same COPY statement.
1 - File signature
The first part of a NATIVE binary file consists of a file signature.
The first part of a NATIVE binary file consists of a file signature. The contents of the signature are fixed, and listed in the following table.
Byte Offset |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Hex Value |
4E |
41 |
54 |
49 |
56 |
45 |
0A |
FF |
0D |
0A |
00 |
Text Literals |
N |
A |
T |
I |
V |
E |
E'\n' |
E'\317' |
E'\r' |
E'\n' |
E'\000' |
The signature ensures that the file has neither been corrupted by a non-8-bit file transfer, nor stripped of carriage returns, linefeeds, or null values. If the signature is intact, Vertica determines that the file has not been corrupted.
2 - Column definitions
Following the file signature, the file must define the widths of each column in the file as follows.
Following the file signature, the file must define the widths of each column in the file as follows.
Byte Offset |
Length (bytes) |
Description |
Comments |
11 |
4 |
Header area length |
32-bit integer in little-endian format that contains the length in bytes of remaining in the header, not including itself. This is the number of bytes from the end of this value to the start of the row data. |
15 |
2 |
NATIVE file version |
16-bit integer in little-endian format containing the version number of the NATIVE file format. The only valid value is currently 1. Future changes to the format could be assigned different version numbers to maintain backward compatibility. |
17 |
1 |
Filler |
Always 0. |
18 |
2 |
Number of columns |
16-bit integer in little-endian format that contains the number of columns in each row in the file. |
20+ |
4 bytes for each column of data in the table |
Column widths |
Array of 32-bit integers in little-endian format that define the width of each column in the row. Variable-width columns have a value of -1 (0xFF 0xFF 0xFF 0xFF). |
Note
All integers in NATIVE files are in little-endian format (least significant byte first).
The width of each column is determined by the data type it contains. The following table explains the column width needed for each data type, along with the data encoding.
Data Type |
Length (bytes) |
Column Content |
INTEGER |
1, 2, 4, 8 |
8-, 16-, 32-, and 64-bit integers are supported. All multi-byte values are stored in little-endian format.
Note: All values for a column must be the width you specify here. If you set the length of an INTEGER column to be 4 bytes, then all of the values you supply for that column must be 32-bit integers.
|
BOOLEAN |
1 |
0 for false, 1 for true. |
FLOAT |
8 |
Encoded in IEEE-754 format. |
CHAR |
User-specified |
-
Strings shorter than the specified length must be right-padded with spaces (E'\040').
-
Strings are not null-terminated.
-
Character encoding is UTF-8.
-
UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.
|
VARCHAR |
4-byte integer (length) + data |
The column width for a VARCHAR column is always -1 to signal that it contains variable-length data.
-
Each VARCHAR column value starts with a 32-bit integer that contains the number of bytes in the string.
-
The string must not be null-terminated.
-
Character encoding must be UTF-8.
-
Remember that UTF-8 strings can contain multi-byte characters. Therefore, number of characters in the string may not equal the number of bytes.
|
DATE |
8 |
64-bit integer in little-endian format containing the Julian day since Jan 01 2000 (J2451545) |
TIME |
8 |
64-bit integer in little-endian format containing the number of microseconds since midnight in the UTC time zone. |
TIMETZ |
8 |
64-bit value where
-
Upper 40 bits contain the number of microseconds since midnight.
-
Lower 24 bits contain time zone as the UTC offset in microseconds calculated as follows: Time zone is logically from -24hrs to +24hrs from UTC. Instead it is represented here as a number between 0hrs to 48hrs. Therefore, 24hrs should be added to the actual time zone to calculate it.
Each portion is stored in little-endian format (5 bytes followed by 3 bytes).
|
TIMESTAMP |
8 |
64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00. |
TIMESTAMPTZ |
8 |
A 64-bit integer in little-endian format containing the number of microseconds since Julian day: Jan 01 2000 00:00:00 in the UTC timezone. |
INTERVAL |
8 |
64-bit integer in little-endian format containing the number of microseconds in the interval. |
BINARY |
User-specified |
Similar to CHAR. The length should be specified in the file header in the Field Lengths entry for the field. The field in the record must contain length number of bytes. If the value is smaller than the specified length, the remainder should be filled with nulls (E'\000'). |
VARBINARY |
4-byte integer + data |
Stored just like VARCHAR but data is interpreted as bytes rather than UTF-8 characters. |
NUMERIC |
(precision, scale) (precision ¸ 19 + 1) ´ 8 rounded up |
A constant-length data type. Length is determined by the precision, assuming that a 64-bit unsigned integer can store roughly 19 decimal digits. The data consists of a sequence of 64-bit integers, each stored in little-endian format, with the most significant integer first. Data in the integers is stored in base 264. 2's complement is used for negative numbers.
If there is a scale, then the numeric is stored as numeric ´ 10scale; that is, all real numbers are stored as integers, ignoring the decimal point. It is required that the scale matches that of the target column in the dataanchor table. Another option is to use FILLER columns to coerce the numeric to the scale of the target column.
|
3 - Row data
Following the file header is a sequence of records that contain the data for each row of data.
Following the file header is a sequence of records that contain the data for each row of data. Each record starts with a header:
Length (bytes) |
Description |
Comments |
4 |
Row length |
A 32-bit integer in little-endian format containing the length of the row's data in bytes. It includes the size of data only, not the header.
Note: The number of bytes in each row can vary not only because of variable-length data, but also because columns containing NULL values do not have any data in the row. If column 3 has a NULL value, then column 4's data immediately follows the end of column 2's data. See the next
|
Number of columns ¸ 8 rounded up (CEILING(NumFields / ( sizeof(uint8) * 8) ); ) |
Null value bit field |
A series of bytes whose bits indicate whether a column contains a NULL. The most significant bit of the first byte indicates whether the first column in this row contains a NULL, the next most significant bit indicates whether the next column contains a NULL, and so on. If a bit is 1 (true) then the column contains a NULL, and there is no value for the column in the data for the row. |
Following the record header is the column values for the row. There is no separator characters for these values. Their location in the row of data is calculated based on where the previous column's data ended. Most data types have a fixed width, so their location is easy to determine. Variable-width values (such as VARCHAR and VARBINARY) start with a count of the number of bytes the value contains.
See the table in the previous section for details on how each data type's value is stored in the row's data.
4 - Loading a NATIVE file into a table: example
The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data.
The example below demonstrates creating a table and loading a NATIVE file that contains a single row of data. The table contains all possible data types.
=> CREATE TABLE allTypes (INTCOL INTEGER,
FLOATCOL FLOAT,
CHARCOL CHAR(10),
VARCHARCOL VARCHAR,
BOOLCOL BOOLEAN,
DATECOL DATE,
TIMESTAMPCOL TIMESTAMP,
TIMESTAMPTZCOL TIMESTAMPTZ,
TIMECOL TIME,
TIMETZCOL TIMETZ,
VARBINCOL VARBINARY,
BINCOL BINARY,
NUMCOL NUMERIC(38,0),
INTERVALCOL INTERVAL
);
=> COPY allTypes FROM '/home/dbadmin/allTypes.bin' NATIVE;
=> \pset expanded
Expanded display is on.
=> SELECT * from allTypes;
-[ RECORD 1 ]--+------------------------
INTCOL | 1
FLOATCOL | -1.11
CHARCOL | one
VARCHARCOL | ONE
BOOLCOL | t
DATECOL | 1999-01-08
TIMESTAMPCOL | 1999-02-23 03:11:52.35
TIMESTAMPTZCOL | 1999-01-08 07:04:37-05
TIMECOL | 07:09:23
TIMETZCOL | 15:12:34-04
VARBINCOL | \253\315
BINCOL | \253
NUMCOL | 1234532
INTERVALCOL | 03:03:03
The content of the allTypes.bin
file appears below as a raw hex dump:
4E 41 54 49 56 45 0A FF 0D 0A 00 3D 00 00 00 01 00 00 0E 00
08 00 00 00 08 00 00 00 0A 00 00 00 FF FF FF FF 01 00 00 00
08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00 08 00 00 00
FF FF FF FF 03 00 00 00 18 00 00 00 08 00 00 00 73 00 00 00
00 00 01 00 00 00 00 00 00 00 C3 F5 28 5C 8F C2 F1 BF 6F 6E
65 20 20 20 20 20 20 20 03 00 00 00 4F 4E 45 01 9A FE FF FF
FF FF FF FF 30 85 B3 4F 7E E7 FF FF 40 1F 3E 64 E8 E3 FF FF
C0 2E 98 FF 05 00 00 00 D0 97 01 80 F0 79 F0 10 02 00 00 00
AB CD AB CD 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 64 D6 12 00 00 00 00 00 C0 47 A3 8E 02 00 00 00
The following table breaks this file down into each of its components, and describes the values it contains.
Hex Values |
Description |
Value |
4E 41 54 49 56 45 0A FF 0D 0A 00 |
Signature |
NATIVE\n\317\r\n\000 |
3D 00 00 00 |
Header area length |
61 bytes |
01 00 |
Native file format version |
Version 1 |
00 |
Filler value |
0 |
0E 00 |
Number of columns |
14 columns |
08 00 00 00 |
Width of column 1 (INTEGER) |
8 bytes |
08 00 00 00 |
Width of column 2 (FLOAT) |
8 bytes |
0A 00 00 00 |
Width of column 3 (CHAR(10)) |
10 bytes |
FF FF FF FF |
Width of column 4 (VARCHAR) |
-1 (variable width column) |
01 00 00 00 |
Width of column 5 (BOOLEAN) |
1 bytes |
08 00 00 00 |
Width of column 6 (DATE) |
8 bytes |
08 00 00 00 |
Width of column 7 (TIMESTAMP) |
8 bytes |
08 00 00 00 |
Width of column 8 (TIMESTAMPTZ) |
8 bytes |
08 00 00 00 |
Width of column 9 (TIME) |
8 bytes |
08 00 00 00 |
Width of column 10 (TIMETZ) |
8 bytes |
FF FF FF FF |
Width of column 11 (VARBINARY) |
-1 (variable width column) |
03 00 00 00 |
Width of column 12 (BINARY) |
3 bytes |
18 00 00 00 |
Width of column 13 (NUMERIC) |
24 bytes. The size is calculated by dividing 38 (the precision specified for the numeric column) by 19 (the number of digits each 64-bit chunk can represent) and adding 1. 38 ¸ 19 + 1 = 3. then multiply by eight to get the number of bytes needed. 3 ´ 8 = 24 bytes. |
08 00 00 00 |
Width of column 14 (INTERVAL). last portion of the header section. |
8 bytes |
73 00 00 00 |
Number of bytes of data for the first row. this is the start of the first row of data. |
115 bytes |
00 00 |
Bit field for the null values contained in the first row of data |
The row contains no null values. |
01 00 00 00 00 00 00 00 |
Value for 64-bit INTEGER column |
1 |
C3 F5 28 5C 8F C2 F1 BF |
Value for the FLOAT column |
-1.11 |
6F 6E 65 20 20 20 20 20 20 20 |
Value for the CHAR(10) column |
"one " (padded With 7 spaces to fill the full 10 characters for the column) |
03 00 00 00 |
The number of bytes in the following VARCHAR value. |
3 bytes |
4F 4E 45 |
The value for the VARCHAR column |
"ONE " |
01 |
The value for the BOOLEAN column |
True |
9A FE FF FF FF FF FF FF |
The value for the DATE column |
1999-01-08 |
30 85 B3 4F 7E E7 FF FF |
The value for the TIMESTAMP column |
1999-02-23 03:11:52.35 |
40 1F 3E 64 E8 E3 FF FF |
The value for the TIMESTAMPTZ column |
1999-01-08 07:04:37-05 |
C0 2E 98 FF 05 00 00 00 |
The value for the TIME column |
07:09:23 |
D0 97 01 80 F0 79 F0 10 |
The value for the TIMETZ column |
15:12:34-05 |
02 00 00 00 |
The number of bytes in the following VARBINARY value |
2 bytes |
AB CD |
The value for the VARBINARY column |
Binary data (\253\315 as octal values) |
AB CD |
The value for the BINARY column |
Binary data (\253\315 as octal values) |
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 64 D6 12 00 00 00 00 00 |
The value for the NUMERIC column |
1234532 |
C0 47 A3 8E 02 00 00 00 |
The value for the INTERVAL column |
03:03:03 |