This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Complex types

Tables can have columns of complex types, including nested complex types.

Tables can have columns of complex types, including nested complex types. You can use the ROW (struct), ARRAY, and SET types in native and external tables, including flex tables. Sets are limited to one-dimensional collections of scalar types. A limited MAP type is available for external tables, but you can use ARRAY and ROW to express a map instead. Selected parsers support loading data with complex types.

You can define a column for heterogeneous combinations of the ARRAY and ROW types: a struct containing array fields or an array of structs. These types can be nested up to the maximum nesting depth of 100.

Restrictions for native tables

Complex types used in native tables have some restrictions, in addition to the restrictions for individual types listed on their reference pages:

  • A native table must have at least one column that is a primitive type or a native array (one-dimensional array of a primitive type). If a flex table has real columns, it must also have at least one column satisfying this restriction.

  • Complex type columns cannot be used in ORDER BY or PARTITION BY clauses nor as FILLER columns.

  • Complex type columns cannot have constraints.

  • Complex type columns cannot use DEFAULT or SET USING.

  • Expressions returning complex types cannot be used as projection columns, and projections cannot be segmented or ordered by columns of complex types.

See CREATE TABLE and ALTER TABLE for additional restrictions.

Deriving a table definition from the data

You can use the INFER_TABLE_DDL function to inspect Parquet, ORC, JSON, or Avro data and produce a starting point for a table definition. This function returns a CREATE TABLE statement, which might require further editing. For columns where the function could not infer the data type, the function labels the type as unknown and emits a warning. For VARCHAR and VARBINARY columns, you might need to adjust the length. Always review the statement the function returns, but especially for tables with many columns, using this function can save time and effort.

Parquet, ORC, and Avro files include schema information, but JSON files do not. For JSON, the function inspects the raw data to produce one or more candidate table definitions. In the following example, two input files differ in the structure of the menu column:

=> SELECT INFER_TABLE_DDL ('/data/*.json'
    USING PARAMETERS table_name='restaurants', format='json',
max_files=3, max_candidates=3);
WARNING 0:  This generated statement contains one or more float types which might lose precision
WARNING 0:  This generated statement contains one or more varchar/varbinary types which default to length 80

                INFER_TABLE_DDL
------------------------------------------------------------------------
 Candidate matched 1 out of 2 total files:
  create table "restaurants"(
    "cuisine" varchar,
    "location_city" Array[varchar],
    "menu" Array[Row(
      "item" varchar,
      "price" float
    )],
    "name" varchar
  );
Candidate matched 1 out of 2 total files:
  create table "restaurants"(
    "cuisine" varchar,
    "location_city" Array[varchar],
    "menu" Array[Row(
      "items" Array[Row(
        "item" varchar,
        "price" numeric
      )],
      "time" varchar
    )],
    "name" varchar
  );

(1 row)

Alternative to strong typing

Though you can fully specify a column representing any combination of ROW and ARRAY types, there might be cases where you prefer a more flexible approach. If the data contains a struct with hundreds of fields, only a few of which you need, you might prefer to extract just those few at query time instead of defining all of the fields. Similarly, if the data structure is likely to change, you might prefer to defer fully specifying the complex types. You can use flexible columns as an alternative to fully specifying the structure of a complex column. This is the same approach used for flex tables, where all data is initially loaded into a single binary column and materialized from there as needed. See Flexible complex types for more information about using this approach.

1 - Structs

Columns can contain structs, which store property-value pairs.

Columns can contain structs, which store property-value pairs. For example, a struct representing an address could have strings for the street address and city/state and an integer for the postal code:

{ "street":"150 Cambridgepark Dr.",
  "city":"Cambridge MA",
  "postalcode":02140}

Struct fields can be primitive types or other complex types.

Use the ROW expression to define a struct column. In the following example, the data has columns for customer name, address, and account number, and the address is a struct in the data. The types you declare in Vertica must be compatible with the types in the data you load into them.

=> CREATE TABLE customers (
    name VARCHAR,
    address ROW(street VARCHAR, city VARCHAR, zipcode INT),
    accountID INT);

Within the ROW, you specify the fields and their data types using the same syntax as for columns. Vertica treats the ROW as a single column for purposes of queries.

Structs can contain other structs. In the following example, employees have various personal information, including an address which is itself a struct.

=> CREATE TABLE employees(
    employeeID INT,
    personal ROW(
      name VARCHAR,
      address ROW(street VARCHAR, city VARCHAR, zipcode INT),
      taxID INT),
    department VARCHAR);

Structs can contain arrays of primitive types, arrays, or structs.

=> CREATE TABLE customers(
  name VARCHAR,
  contact ROW(
    street VARCHAR,
    city VARCHAR,
    zipcode INT,
    email ARRAY[VARCHAR]
  ),
  accountid INT );

When defining an external table with Parquet or ORC data, Vertica requires the definition of the table to match the schema of the data. For example, with the data used in the previous employees example, the following definition is an error:

=> CREATE EXTERNAL TABLE employees(
    employeeID INT,
    personal ROW(
      name VARCHAR,
      address ROW(street VARCHAR, city VARCHAR),
      zipcode INT,
      taxID INT),
    department VARCHAR)
  AS COPY FROM '...' PARQUET;
ERROR 9151: Datatype mismatch [...]

The data contains an address struct with three fields (street, city, zipcode), so the external table must also use a ROW with three fields. Changing the ROW to have two fields and promoting one of the fields to the parent ROW is a mismatch. Each ROW must match and, if structs are nested in the data, the complete structure must match.

For native tables, you can specify which columns to load from the data, so you do not need to account for all of them. For the columns you load, the definition of the table must match the schema in the data file. Some parsers report fields found in that data that are not part of the table definition.

Handling nulls

If a struct exists but a field value is null, Vertica assigns NULL as its value in the ROW. A struct where all fields are null is treated as a ROW with null fields. If the struct itself is null, Vertica reads the ROW as NULL.

Queries

See Rows (structs).

Restrictions

ROW columns have several restrictions:

  • Maximum nesting depth is 100.
  • Vertica tables support up to 9800 columns and fields. The ROW itself is not counted, only its fields.
  • ROW columns cannot use any constraints (such as NOT NULL) or defaults.
  • ROW fields cannot be auto_increment or setof.
  • ROW definition must include at least one field.
  • Row is a reserved keyword within a ROW definition, but is permitted as the name of a table or column.
  • Tables containing ROW columns cannot also contain IDENTITY, default, SET USING, or named sequence columns.

2 - Arrays

Columns can contain arrays, which store ordered lists of elements of the same type.

Columns can contain arrays, which store ordered lists of elements of the same type. For example, an address column could use an array of strings to store multiple addresses that an individual might have, such as ['668 SW New Lane', '518 Main Ave', '7040 Campfire Dr'].

There are two types of arrays:

  • Native array: a one-dimensional array of a primitive type.

  • Non-native array: all other supported arrays, including arrays that contain other arrays (multi-dimensional arrays) or structs (ROWs). Non-native arrays have some usage restrictions.

Use the ARRAY type to define an array column, specifying the type of its elements (a primitive type, a ROW (struct), or an array):

=> CREATE TABLE orders
   (orderkey    INT,
    custkey     INT,
    prodkey     ARRAY[VARCHAR(10)],
    orderprices ARRAY[DECIMAL(12,2)],
    orderdate   DATE
   );

If an array is multi-dimensional, represent it as an array containing an array:

ARRAY[ARRAY[FLOAT]]

Queries

See Arrays and sets (collections).

Restrictions

  • Native arrays support only data of primitive types, for example, int, UUID, and so on.

  • Array dimensionality is enforced. A column cannot contain arrays of varying dimensions. For example, a column that contains a three-dimensional array can only contain other three-dimensional arrays; it cannot simultaneously include a one-dimensional array. However, the arrays in a column can vary in size, where one array can contain four elements while another contains ten.

  • Array bounds, if specified, are enforced for all operations that load or alter data. Unbounded arrays may have as many elements as will fit in the allotted binary size.

  • An array has a maximum binary size. If this size is not set when the array is defined, a default value is used.

  • Arrays do not support LONG types (like LONG VARBINARY or LONG VARCHAR) or user-defined types (like Geometry).

3 - Flexible complex types

When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays.

When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays. However, there are times when you might prefer not to:

  • If the data contains a struct with a very large number of fields, and in your queries you will need only a few of them, you can avoid having to enumerate the rest in the table DDL. Further, a deeply-nested set of structs could exceed the nesting limit for the table if fully specified.

  • If the data schema is still evolving, you can delay finalizing strongly-typed DDL.

  • If you anticipate the introduction of new fields in the data, you can use flexible types to discover them. A table with strong typing, on the other hand, would silently ignore those values. For an example of using flexible types to discover new fields, see Strong and Flexible Typing.

Flexible types are a way to store complex or unstructured data as a binary blob in one column, in a way that allows access to individual elements of that data. This is the same approach that Vertica uses with flex tables, which support loading unstructured or semi-structured data. In a flex table, all data from a source is loaded into a single VMap column named __raw__. From this column you can materialize other columns, such as a specific field in JSON data, or use special lookup functions in queries to read values directly out of the __raw__ column.

Vertica uses a similar approach with complex types. You can describe the types fully using the ROW and ARRAY types in your table definition, or you can instead treat a complex type as a flexible type and not fully describe it. Each complex type that you choose to treat this way becomes its own flex-style column. You are not limited to a single column containing all data as in flex tables; instead, you can treat any complex type column, no matter how deeply it nests other types, as one flex-like column.

Defining flexible columns

To use a flexible complex type, declare the column as LONG VARBINARY. You might also need to set other parameters in the parser, as described in the parser documentation.

Consider a Parquet file with a restaurants table and the following columns:

  • name: varchar

  • cuisine type: varchar

  • location (cities): array[varchar]

  • menu: array of structs, each struct having an item name and a price

This data contains two complex columns, location (an array) and menu (an array of structs). The following example defines both columns as flexible columns by using LONG VARBINARY:

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR,
        location_city LONG VARBINARY, menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(allow_long_varbinary_match_complex_type='True');

The allow_long_varbinary_match_complex_type parameter is specific to the Parquet parser. It is required if you define any column as a flexible type. Without this parameter, Vertica tries to match the LONG VARBINARY declaration in the table to a VARBINARY column in the Parquet file, finds a complex type instead, and reports a data-type mismatch.

You need not treat all complex columns as flexible types. The following definition is also valid:

=> CREATE EXTERNAL TABLE restaurants(
            name VARCHAR, cuisine VARCHAR,
            location_city ARRAY[VARCHAR,50],
            menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(allow_long_varbinary_match_complex_type='True');

For many common data formats, you can use the INFER_TABLE_DDL function to derive a table definition from a data file. This function uses strong typing for complex types in almost all cases.

Querying flexible columns

Flexible columns are stored as LONG VARBINARY, so selecting them directly produces unhelpful results. Instead, use the flex mapping functions to extract values from these columns. The MAPTOSTRING function translates the complex type to JSON, as shown in the following example:

=> SELECT name, location_city, MAPTOSTRING(menu) AS menu FROM restaurants;
       name        |       location_city        |                    menu
-------------------+---------+------------------+----------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 Bakersfield Tacos | ["Pittsburgh"]             | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

The menu column is an array of structs. Notice that the output is a set of key/value pairs, with the key being the array index. Bob's Pizzeria has two items on its menu, and each value is a struct. The first item ("0") is a struct with an "item" value of "cheese pizza" and a "price" of "$8.25".

You can use keys to access specific values. The following example selects the first menu item from each restaurant. Note that all keys are strings, even array indexes:

=> SELECT name, location_city, menu['0']['item'] AS item, menu['0']['price'] AS price FROM restaurants;
       name        |       location_city        |     item     | price
-------------------+----------------------------+--------------+-------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco  | $9.95
(2 rows)

Instead of accessing specific indexes, you can use the MAPITEMS function in a subquery to explode a flexible type, as in the following example:

=>  SELECT name, location_city, menu_items['item'], menu_items['price']
    FROM (SELECT mapitems(menu, name, location_city) OVER(PARTITION BEST)
         AS (indexes, menu_items, name, location_city)
    FROM restaurants) explode_menu;
       name        |       location_city        |  menu_items   | menu_items
-------------------+----------------------------+---------------+------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza  | $8.25
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | spinach pizza | $10.50
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco   | $9.95
 Bakersfield Tacos | ["Pittsburgh"]             | steak taco    | $10.95
(4 rows)

For a complete list of flex mapping functions, see Flex data functions.

JSON and Avro flexible types

The parsers for JSON and Avro support both flexible and strong types for complex types. When using flexible complex types or loading into a flex table, use the flatten_maps and flatten_arrays parameters to control how the parser handles complex data. These parsers ignore these parameters for strongly-typed complex types.

The following example demonstrates the use of flexible complex types. Consider a JSON file containing the following data:

{
    "name" : "Bob's pizzeria",
    "cuisine" : "Italian",
    "location_city" : ["Cambridge", "Pittsburgh"],
    "menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
              {"item" : "spinach pizza", "price" : "$10.50"}]
}
{
    "name" : "Bakersfield Tacos",
    "cuisine" : "Mexican",
    "location_city" : ["Pittsburgh"],
    "menu" : [{"item" : "veggie taco", "price" : "$9.95"},
              {"item" : "steak taco", "price" : "$10.95"}]
}

Create a table, using LONG VARBINARY for the flexible complex types, and load data specifying these parameters:

=> CREATE TABLE restaurant(name VARCHAR, cuisine VARCHAR,
        location_city LONG VARBINARY, menu LONG VARBINARY);

=> COPY restaurant FROM '/data/restaurant.json'
   PARSER FJSONPARSER(flatten_maps=false, flatten_arrays=false);

You can use Flex functions and direct access (through indexes) to return readable values:

=> SELECT MAPTOSTRING(location_city), MAPTOSTRING(menu) FROM restaurant;
                   maptostring                   |             maptostring
-------------------------------------------------+--------------------------------------------------------
 {
    "0": "Cambridge",
    "1": "Pittsburgh"
} | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 {
    "0": "Pittsburgh"
}                       | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

=> SELECT menu['0']['item'] FROM restaurant;
     menu
--------------
 cheese pizza
 veggie taco
(2 rows)

The COPY statement shown in this example sets flatten_maps to false. Without that change, the keys for the complex columns would not work as expected, because record and array keys would be "flattened" at the top level. Querying menu['0']['item'] would produce no results. Instead, query flattened values as in the following example:

=> SELECT menu['0.item'] FROM restaurant;
     menu
--------------
 veggie taco
 cheese pizza
(2 rows)

Flattening directives apply to the entire COPY statement. You cannot flatten some columns and not others, or prevent flattening values in a complex column that is itself within a flattened flex table. Because flexible complex types and strongly-typed complex types require different values for flattening, you cannot combine strong and flexible complex types in the same load operation.

4 - System tables for complex types

Information about all complex types is recorded in the COMPLEX_TYPES system table.

Information about all complex types is recorded in the COMPLEX_TYPES system table. You must have read permission for the external table that uses a type to see its entries in this system table. Complex types are not shown in the TYPES system table.

For ROW types, each row in COMPLEX_TYPES represents one field of one ROW. The field name is the name used in the table definition if present, or a generated name beginning with _field otherwise. Each row also includes the (generated) name of its containing type, a string beginning with _ct_. ("CT" stands for "complex type".)

The following example defines one external table and then shows the types in COMPLEX_TYPES:

=> CREATE EXTERNAL TABLE warehouse(
    name VARCHAR, id_map MAP<INT,VARCHAR>,
    data row(record INT, total FLOAT, description VARCHAR(100)),
    prices ARRAY[INT], comment VARCHAR(200), sales_total FLOAT, storeID INT)
  AS COPY FROM ... PARQUET;

=> SELECT type_id,type_kind,type_name,field_id,field_name,field_type_name,field_position
    FROM COMPLEX_TYPES ORDER BY type_id,field_name;

      type_id      | type_kind |       type_name       | field_id | field_name  | field_type_name | field_position
-------------------+-----------+-----------------------+----------+-------------+-----------------+----------------
 45035996274278280 | Map       | _ct_45035996274278280 |        6 | key         | int             |              0
 45035996274278280 | Map       | _ct_45035996274278280 |        9 | value       | varchar(80)     |              1
 45035996274278282 | Row       | _ct_45035996274278282 |        9 | description | varchar(80)     |              2
 45035996274278282 | Row       | _ct_45035996274278282 |        6 | record      | int             |              0
 45035996274278282 | Row       | _ct_45035996274278282 |        7 | total       | float           |              1
 45035996274278284 | Array     | _ct_45035996274278284 |        6 |             | int             |              0
(6 rows)

This table shows the fields for the two ROW types defined in the table. When a ROW contains another ROW, as is the case here with the nested address field, the field_type_name column uses the generated name of the contained ROW. The same number, minus the leading "ct", serves as the field_id.