Troubleshooting data loads

You might encounter the following issues when loading data.

You might encounter the following issues when loading data. For issues specific to external tables, see Troubleshooting external tables.

Cannot load data because all columns are complex types

A native table must contain at least one column that is either a scalar type or a native array (one-dimensional array of scalar types). If the data you wish to load consists entirely of complex types, you must add a scalar column to the table before loading the data. You do not need to populate the scalar column.

First, create a native table with the columns in the data plus one scalar column, and then load the data into that table as shown in the following example:

-- Native table with extra scalar column:
=> CREATE TABLE customers(person ROW(name VARCHAR, card VARCHAR), tempval int);
CREATE TABLE

-- Load complex data, ignoring scalar column:
=> COPY customers(person) FROM :custdata PARQUET;
 Rows Loaded
-------------
           3
(1 row)

Reads from Parquet files report unexpected data-type mismatches

If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example:


ERROR 7247:  Datatype mismatch: column 2 in the parquet_cpp source
[/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int

This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING.

Error 7087: wrong number of columns

When loading ORC or Parquet data, you might see an error stating that you have the wrong number of columns:

=> CREATE TABLE nation (nationkey bigint, name varchar(500),
            regionkey bigint, comment varchar(500));
   CREATE TABLE

=> COPY nation from :orc_dir ORC;
   ERROR 7087: Attempt to load 4 columns from an orc source
   [/tmp/orc_glob/test.orc] that has 9 columns

When you load data from ORC or Parquet files, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition.

For Parquet data, time zones in timestamp values are not correct

Reading timestamps from a Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the Parquet format does not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets timestamps read from Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time such as the following:

WARNING 0:  SQL TIMESTAMPTZ is more appropriate for Parquet TIMESTAMP
because values are stored in UTC

When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP.

Time zones can also be incorrect in ORC data, but the reason is different.

For ORC data, time zones in timestamp values are not correct

Vertica and Hive both use the Apache ORC library to interact with ORC data. The behavior of this library changed with Hive version 1.2.0, so timestamp representation depends on what version was used to write the data.

When writing timestamps, the ORC library now records the time zone in the stripe footer. Vertica looks for this value and applies it when loading timestamps. If the file was written with an older version of the library, the time zone is missing from the file.

If the file does not contain a time zone, Vertica uses the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table.

The first time you query a new ORC data source, you should query this table to look for missing time zone information:

=> SELECT event_category, event_type, event_description, operator_name, event_details, COUNT(event_type)
   AS COUNT FROM QUERY_EVENTS WHERE event_type ILIKE 'ORC_FILE_INFO'
   GROUP BY event_category, event_type, event_description, operator_name, event_details
   ORDER BY event_details;
event_category | event_type | event_description | operator_name | event_details | count
----------------+---------------+----------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-------
EXECUTION | ORC_FILE_INFO | ORC file does not have writer timezone information | OrcParser | Timestamp values in the ORC source [data/sales_stats.orc] will be computed using local timezone | 2

(1 row)

Time zones can also be incorrect in Parquet data, but the reason is different.

Some date and timestamp values are wrong by several days

When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values.