This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data exploration

If the data you need to analyze is already well-specified in a schema, such as if it was exported from another database, then you can usually proceed to defining tables and loading the data.

If the data you need to analyze is already well-specified in a schema, such as if it was exported from another database, then you can usually proceed to defining tables and loading the data. Often, however, your initial data is less clear or is in a format that does not require a consistent schema, like JSON. Before you can define database tables for your data, you must explore that data in more detail and make decisions about how to represent it in tables.

During data exploration, you develop a schema and decide what kind of table to use in your production database. In some cases, you might choose to use external tables to describe data, and in other cases you might choose to load data into the database. Either way, over time, you might need to make changes if the schema evolves. For example, new data might add columns or change the data type of existing columns. You should expect to revisit your data schema over time. When you do, use the techniques described in the following sections.

1 - Develop a schema

Vertica provides two general approaches to developing a table definition during data exploration without resorting to inspecting data files by hand: inference and flex tables.

Vertica provides two general approaches to developing a table definition during data exploration without resorting to inspecting data files by hand: inference and flex tables.

If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition. The output indicates where the function could not make a decision, so that you can focus on those areas as you refine the definition. Using this approach, you iteratively test sample data, refine the table definition, and test more data until you arrive at a satisfactory table definition.

Alternatively, you can use flex tables, a special type of table designed for exploring data with an unknown or varying schema. You can use flex functions to explore this data and materialize individual columns in your table. When you have identified the columns that should consistently be part of your table, you can then define a regular (non-flex) table.

1.1 - Inference

If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition.

If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition. You can then refine this definition, such as by specifying VARCHAR lengths, NUMERIC precision, and ARRAY bounds. If the data format supports it, you can ignore columns in the data that you do not need for your table.

When you have an initial table definition, you can load more data. New data might have different properties, so it's important to adjust your table definition in light of more data. Test the table definition with enough data to feel confident before putting it into production.

You might find it easier to do initial exploration with external tables, whose definitions are more easily modified.

Create first draft of table

The INFER_TABLE_DDL function inspects a data file and returns a CREATE TABLE or CREATE EXTERNAL TABLE statement based on its contents. The column definitions in the returned statement can be incomplete, and sometimes the function cannot determine a column's data type. The function also uses the column names as they appear in the data, even if they violate Vertica restrictions. For all of these reasons, you must review and adjust the proposed definition.

In the following example, the input path contains data for a product catalog. Note the warnings about data types:

=> SELECT INFER_TABLE_DDL('/data/products/products.parquet'
    USING PARAMETERS format = 'parquet', table_name = 'products');

WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80

     INFER_TABLE_DDL

------------------------------------------------------------------
 create table "products"(
  "id" int,
  "msrp" numeric(6,2),
  "name" varchar,
  "description" varchar,
  "vendors" Array[Row(
    "name" varchar,
    "price" numeric(6,2)
  )]
);
(1 row)

The table definition contains several VARCHAR values with unspecified lengths. The Vertica default length for VARCHAR is 80, and for a native table, Vertica allocates memory based on the maximum length. If a value is unlikely to be that large, such as for a name, you can specify a smaller size to improve query performance. Conversely, if a value could be much longer, such as the description field here, you can specify a larger size so that Vertica does not truncate the value.

For numeric values, INFER_TABLE_DDL suggests precision and scale based on the data in the sample file. As with VARCHAR, values that are too high cause more memory to be allocated. In addition, computations could produce unexpected results by using the extra precision instead of rounding.

Arrays, by default, are unbounded, and Vertica allocates memory to accommodate the maximum binary size. To reduce this footprint, you can specify a maximum number of elements for an array or its maximum binary size. For more information about array bounds, see Limits on Element Count and Collection Size on the ARRAY reference page.

In this example, you might make the following changes:

  • Shorten the product name and vendor name fields, for example VARCHAR(32).

  • Lengthen the description field, for example VARCHAR(1000).

  • Accept the suggested numeric precision and scale.

  • Specify a maximum number of elements for the vendors array.

With these changes, the table definition becomes:

=> CREATE TABLE products(
      id INT,
      msrp NUMERIC(6,2),
      name VARCHAR(32),
      description VARCHAR(1000),
      vendors ARRAY[ROW(name VARCHAR(32), price NUMERIC(6,2)),16]
      );

As a sanity check, start by loading the data file you used for INFER_TABLE_DDL. All rows should load successfully. The following query output has added carriage returns for readability:

=> COPY products FROM '/data/products/products.parquet' PARQUET;
 Rows Loaded
-------------
           1
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]

Test with data

INFER_TABLE_DDL bases its recommendations on a small sample of your data. As you load more data you might find outliers. Before you use your new table in production, test it with more data.

In the following example, a data file contains values that do not satisfy the table definition:

=> COPY products FROM '/data/products/products2.parquet' PARQUET;

WARNING 9738:  Some rows were rejected by the parquet parser
 Rows Loaded
-------------
           2
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]
-[ RECORD 2 ]---------------------------------------------------------------------------------------------------------------------
id          | 1271
msrp        | 8999.99
name        | professional kitchen appliance s
description | product description goes here
vendors     | [{"name":"Amazon","price":"8299.99"},{"name":"Designer Solutions","price":"8999.99"}]

If you compare the query output to the previous query, one immediate difference is the width of the name column. The value in the second row uses the full VARCHAR(32) length and appears to be truncated.

By default, COPY truncates VARCHAR values that are too long instead of rejecting them. The warning from the COPY statement indicates that in addition to the name length, there is another problem that caused data not to be loaded.

COPY can report rejected data to a (different) table. You can also direct COPY to enforce VARCHAR lengths instead of truncating:

=> COPY products FROM '/data/products/products2.parquet' PARQUET
ENFORCELENGTH REJECTED DATA AS TABLE products_rejected;
 Rows Loaded
-------------
           1
(1 row)

Note that only one row of data is loaded this time instead of two, because ENFORCELENGTH rejects the too-long string.

You can query the rejections table for more information:

=> \x

=> SELECT rejected_reason, rejected_data FROM products_rejected;

-[ RECORD 1 ]---+--------------------------------------------------------------------
rejected_reason | The 34-byte value is too long for type varchar(32), column 3 (name)
rejected_data   | professional kitchen appliance set
-[ RECORD 2 ]---+--------------------------------------------------------------------
rejected_reason | In column 2: Value (11999.99) exceeds range of type NUMERIC(6,2)
rejected_data   | 11999.99

The first rejection is caused by the long string. The second is caused by a price that exeeds the defined scale. You can increase both using ALTER TABLE:

=> ALTER TABLE products ALTER COLUMN name SET DATA TYPE VARCHAR(50);

=> ALTER TABLE products ALTER COLUMN msrp SET DATA TYPE NUMERIC(8,2);

With these changes, the data file now loads without errors:

=> TRUNCATE TABLE products;

=> DROP TABLE products_rejected;

=> COPY products FROM '/data/products/products2.parquet' PARQUET
ENFORCELENGTH REJECTED DATA AS TABLE products_rejected;

 Rows Loaded
-------------
           3
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]
-[ RECORD 2 ]---------------------------------------------------------------------------------------------------------------------
id          | 1183
msrp        | 11999.99
name        | home theatre system
description | product description...
vendors     | [{"name":"Amazon","price":"8949.99"},{"name":"Bob's Hardware","price":"9999.99"}]
-[ RECORD 3 ]---------------------------------------------------------------------------------------------------------------------
id          | 1271
msrp        | 8999.99
name        | professional kitchen appliance set
description | product description goes here
vendors     | [{"name":"Amazon","price":"8299.99"},{"name":"Designer Solutions","price":"8999.99"}]

=> SELECT rejected_reason, rejected_data FROM products_rejected;
(No rows)

Continue to load more data using the ENFORCELENGTH and REJECTED DATA options, adjusting your table definition as needed, until the rate of new problems drops to an acceptable level. Most data sets of any size contain some anomalies that will be rejected. You can use the REJECTMAX option to set a limit on rejections per data load; if the limit is exceeded, COPY aborts the load.

Data formats

INFER_TABLE_DDL supports data in Parquet, ORC, Avro, and JSON formats. The first three formats enforce a schema on each file, but JSON data can contain any mix of fields and data types. When inspecting JSON data, INFER_TABLE_DDL can report multiple options. The iterative process of loading samples, examining the results, and adjusting the table definition is the same, but some details are different for JSON.

For data in other formats, including delimited text data, you cannot use INFER_TABLE_DDL. You can examine sample data and create your own initial table definition, and then test iteratively. Alternatively, you can use a flex table to explore the data.

1.2 - Flex tables

Flex tables are designed for data with an unknown or varying schema.

Flex tables are designed for data with an unknown or varying schema. While flex tables generally do not meet the performance needs of a production system, you can use them to explore data and develop a table definition. You can use flex tables with all data formats, including delimited (CSV).

A flex table has a special column, named __raw__, that contains all fields from data you load into it. Each row is independent and might have different fields in this column. You can use flex functions to explore this data and materialize individual columns in your table. When you have identified the columns that should consistently be part of your table, you can then define a regular (non-flex) table.

You can iterate by loading and exploring small amounts of data, as described in Inference. However, flex tables allow you to load larger amounts of data without producing errors and rejections, so you might choose to start with a larger sample.

Column discovery

The simplest flex table does not specify any columns. In this example, the first thousand rows are pulled from a much larger data set:

=> CREATE FLEX TABLE reviews_flex();

=> COPY reviews_flex FROM '/home/data/reviews/sample_1000.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
        1000
(1 row)

The COMPUTE_FLEXTABLE_KEYS function extrapolates field names and candidate data types from the raw data. The more data you have loaded the more work this function needs to do and the longer it takes to finish. If the function runs slowly for you, try a smaller sample size.

=> SELECT COMPUTE_FLEXTABLE_KEYS('reviews_flex');
             COMPUTE_FLEXTABLE_KEYS
-------------------------------------------------
 Please see public.reviews_flex_keys for updated keys
(1 row)

SELECT * FROM reviews_flex_keys;
  key_name   | frequency | data_type_guess
-------------+-----------+-----------------
 user_id     |      1000 | Varchar(44)
 useful      |      1000 | Integer
 text        |      1000 | Varchar(9878)
 stars       |      1000 | Numeric(5,2)
 review_id   |      1000 | Varchar(44)
 funny       |      1000 | Integer
 date        |      1000 | Timestamp
 cool        |      1000 | Integer
 business_id |      1000 | Varchar(44)
(9 rows)

In this example, all 1000 rows have all of the fields. This suggests that the larger data set is probably consistent, particularly if the sample was taken from the middle. In this case, the data was exported from another database. Had the data been more varied, you would see different numbers of occurrences in the keys table.

COMPUTE_FLEXTABLE_KEYS, unlike INFER_TABLE_DDL, proposes lengths for VARCHAR values and precision and scale for NUMERIC values. The function uses the largest values found in the data in the table. Future data could exceed these limits, so make sure you are using representative data or a large-enough sample.

If you query a flex table using SELECT *, the results include the raw column. This column is in a binary format, so querying it is not especially useful. To see a sample of the data, query the fields explicitly, as if they were columns. The following example omits the text of the review, which COMPUTE_FLEXTABLE_KEYS reported as potentially quite large (VARCHAR(9878)):

=> \x
SELECT user_id, useful, stars, review_id, funny, date, cool, business_id
FROM reviews_flex ORDER BY date DESC LIMIT 3;
-[ RECORD 1 ]-----------------------
user_id     | mY8vqRndQ_jHeI-kUZc-rw
useful      | 0
stars       | 5.0
review_id   | viMRTXxVGZa_E0J5I3fD5A
funny       | 0
date        | 2021-01-28 01:25:04
cool        | 0
business_id | LsR1rENf9596Oo07QFjtaQ
-[ RECORD 2 ]-----------------------
user_id     | vLKy-xJoyeSAVhPbG8-Pfg
useful      | 0
stars       | 5.0
review_id   | 8Bcq-HlIpdDJ773AkNAUeg
funny       | 0
date        | 2021-01-27 16:08:53
cool        | 0
business_id | ljNN92X-WRVzhkHeIgnllA
-[ RECORD 3 ]-----------------------
user_id     | yixdPMh2UidS3qOquE1Yow
useful      | 0
stars       | 5.0
review_id   | h_z6g8lOrvUgsLg-bklxVg
funny       | 0
date        | 2021-01-27 15:58:42
cool        | 0
business_id | WlTPQwhKyEDogIXQfP0uYA

This sample shows several integer fields with values of 0. These fields (useful, funny, and cool) appear to be counters indicating how many people found the review to have those qualities. To see how often they are used, which might help you decide whether to include them in your table, you can use the COUNT function:

=> SELECT COUNT(useful) FROM reviews_flex WHERE useful > 0;
 COUNT
-------
   548
(1 row)

=> SELECT COUNT(funny) FROM reviews_flex WHERE funny > 0;
 COUNT
-------
   234
(1 row)

=> SELECT COUNT(cool) FROM reviews_flex WHERE cool > 0;
 COUNT
-------
   296
(1 row)

All of the fields in this data except text are fairly small. How much of the review text you want to include depends on how you plan to use the data. If you plan to analyze the text, such as to perform sentiment analysis, you might set a larger size. If you don't plan to use the text, you might either ignore the text entirely or set a small size and allow the data to be truncated on load:

=> CREATE TABLE reviews(
    review_id VARCHAR(44), business_id VARCHAR(44), user_id VARCHAR(44),
    stars NUMERIC(5,2), date TIMESTAMP,
    useful INT, funny INT, cool INT);

=> COPY reviews FROM '/home/data/reviews/reviews_set1.json' PARSER FJSONPARSER() NO COMMIT;
WARNING 10596:  Warning in UDx call in user-defined object [FJSONParser], code:
0, message: Data source contained keys which did not match table schema
HINT:  SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type
 = 'UNMATCHED_KEY' GROUP BY key
 Rows Loaded
-------------
       10000
(1 row)

The JSON parser warns about keys found in the data that are not part of the table definition. Some other parsers require you to consume all of the fields. Consult the reference pages for specific parsers for more information about whether and how you can ignore fields.

To find out what unmatched fields appeared in the data, query the UDX_EVENTS system table:

=> SELECT key, SUM(num_instances) FROM UDX_EVENTS
WHERE event_type='UNMATCHED_KEY' AND transaction_id = CURRENT_TRANS_ID()
GROUP BY key;
 key  |  SUM
------+-------
 text | 10000
(1 row)

The UDX_EVENTS table shows that the only unmatched field is the previously-seen text field. You can safely commit the load transaction. When you have tested with enough data to feel confident, you can suppress the parser warning as explained on FJSONPARSER.

Complex types

Consider the following JSON data in a flex table:

=> CREATE FLEX TABLE products_flex();
COPY products_flex FROM '/home/data/products/sample.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           3
(1 row)

=> SELECT COMPUTE_FLEXTABLE_KEYS('products_flex');
                 COMPUTE_FLEXTABLE_KEYS
---------------------------------------------------------
 Please see public.products_flex_j_keys for updated keys
(1 row)

=> SELECT * FROM products_flex_keys;
  key_name   | frequency |   data_type_guess
-------------+-----------+---------------------
 description |         3 | Varchar(58)
 id          |         3 | Integer
 msrp        |         3 | Numeric(10,3)
 name        |         3 | Varchar(68)
 vendors     |         3 | long varbinary(298)
(5 rows)

The vendors field has a type of LONG VARBINARY, but the name of the column implies a collection. You can use the MAPTOSTRING function to sample the data:

=> SELECT MAPTOSTRING(__raw__) FROM products_flex LIMIT 1;

                                                MAPTOSTRING
-------------------------------------------------------------------------------
 {
        "description": "something long goes here",
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "vendors": {
                "0.name": "Amazon",
                "0.price": "299.99",
                "1.name": "Best Buy",
                "1.price": "289.99",
                "2.name": "Bob's Hardware",
                "2.price": "309.99"
        }
}
(1 row)

By default, the JSON parser "flattens" JSON maps (key/value pairs). In this example, the value of the vendors field is presented as a set of six key/value pairs, but the key names (number.name and number.price) indicate that the data is more structured. Disabling this flattening makes the true structure of the JSON data more clear:

=> TRUNCATE TABLE products_flex;

=> COPY products_flex FROM '/home/data/products/sample.json'
PARSER FJSONPARSER(flatten_maps=false);
 Rows Loaded
-------------
           3
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM products_flex LIMIT 1;

                                                 MAPTOSTRING
--------------------------------------------------------------------------------
 {
        "description": "something long goes here",
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "vendors": {
                "0": {
                        "name": "Amazon",
                        "price": "299.99"
                },
                "1": {
                        "name": "Best Buy",
                        "price": "289.99"
                },
                "2": {
                        "name": "Bob's Hardware",
                        "price": "309.99"
                }
        }
}
(1 row)

This data indicates that the vendors field holds an array of structs with two fields, name and price. Using this information and the computed keys for the other fields in the data, you can define a table with strongly-typed columns:

=> CREATE TABLE products(
    id INT, name VARCHAR(100), msrp NUMERIC(8,2), description VARCHAR(1000),
    vendors ARRAY[ROW(name VARCHAR(50), price NUMERIC(8,2)),16]);

After loading data, you can use an array function to confirm that the collection was loaded correctly:

=> COPY products FROM '/home/data/products/sample.json'
PARSER FJSONPARSER(flatten_maps=false);
 Rows Loaded
-------------
           3
(1 row)

=> SELECT EXPLODE(vendors) OVER() FROM products;
 position |                      value
----------+-------------------------------------------------
        0 | {"name":"Amazon","price":"299.99"}
        1 | {"name":"Best Buy","price":"289.99"}
        2 | {"name":"Bob's Hardware","price":"309.99"}
        0 | {"name":"Amazon","price":"8949.99"}
        1 | {"name":"Bob's Hardware","price":"9999.99"}
        0 | {"name":"Amazon","price":"8299.99"}
        1 | {"name":"Designer Solutions","price":"8999.99"}
(7 rows)

Handling limitations of some file formats

The JSON format supports arrays, structs, and combinations of the two. Some formats are more limited. Delimited (CSV) data can represent arrays of scalar types, but not structs. Further, not all parsers that you can use with flex tables support complex types.

Consider the following delimited data:

id|msrp|name|vendors|prices
1064|329.99|4k 48in TV|[Amazon,Best Buy]|[299.99,289.99]
1183|11999.99|home theatre system|[Amazon,Bob's Hardware]|[8949.99,9999.99]
1271|8999.99|professional kitchen appliance set|[Amazon,Designer Solutions]|[829
9.99,8999.99]

You can load this data into a flex table using FDELIMITEDPARSER, but the parser misinterprets the arrays:

=> CREATE FLEX TABLE catalog_flex();

=> COPY catalog_flex FROM '/home/data/catalog/sample.csv' PARSER FDELIMITEDPARSER();
 Rows Loaded
-------------
           3
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM catalog_flex LIMIT 1;
                                                  MAPTOSTRING
--------------------------------------------------------------------------------
 {
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "prices": "[299.99,289.99]",
        "vendors": "[Amazon,Best Buy]"
}
(1 row)

=> SELECT COMPUTE_FLEXTABLE_KEYS('catalog_flex');
                COMPUTE_FLEXTABLE_KEYS
------------------------------------------------------
 Please see public.catalog_flex_keys for updated keys
(1 row)

=> SELECT * FROM catalog_flex_keys;
 key_name | frequency | data_type_guess
----------+-----------+-----------------
 id       |         3 | Integer
 msrp     |         3 | Numeric(10,3)
 name     |         3 | Varchar(68)
 prices   |         3 | Interval
 vendors  |         3 | Varchar(54)
(5 rows)

The parser has mistaken the arrays as strings, and COMPUTE_FLEXTABLE_KEYS has mistaken an array of numbers for an INTERVAL and an array of strings as a VARCHAR. These are limitations of flex tables used with certain file formats, not of Vertica in general. If you see unusual results, try adding real columns to the flex table:

=> CREATE FLEX TABLE catalog_flex(
        vendors ARRAY[VARCHAR(50), 10],
        prices ARRAY[NUMERIC(10,2), 10] );
WARNING 5727:  Sort clause contains a Array attribute, catalog_flex.prices - data loads may be slowed significantly
HINT:  Consider using a non-long type column for sorting or adding at least one such column to the table

For data exploration you need not be concerned with warnings such as this one. When you are ready to define a production table, heed any warnings you receive.

After loading the data, you can confirm that the arrays were loaded correctly:

=> SELECT EXPLODE(vendors, prices USING PARAMETERS explode_count=2) OVER()
AS (v_idx,vendor,p_idx,price)
FROM catalog_flex;
 v_idx |       vendor       | p_idx |  price
-------+--------------------+-------+---------
     0 | Amazon             |     0 |  299.99
     0 | Amazon             |     1 |  289.99
     1 | Best Buy           |     0 |  299.99
     1 | Best Buy           |     1 |  289.99
     0 | Amazon             |     0 | 8949.99
     0 | Amazon             |     1 | 9999.99
     1 | Bob's Hardware     |     0 | 8949.99
     1 | Bob's Hardware     |     1 | 9999.99
     0 | Amazon             |     0 | 8299.99
     0 | Amazon             |     1 | 8999.99
     1 | Designer Solutions |     0 | 8299.99
     1 | Designer Solutions |     1 | 8999.99
(12 rows)

Create table definition

You can continue to use flex tables to explore data, creating real columns for complex types as you encounter them. FDELIMITEDPARSER is specific to flex tables; for native tables (created with CREATE TABLE), DELIMITED is the default parser:

=> CREATE TABLE catalog(
  id INT, msrp NUMERIC(8,2), name VARCHAR(100),
  vendors ARRAY[VARCHAR(50), 10], prices ARRAY[NUMERIC(10,2), 10] );

=> COPY catalog FROM '/home/data/catalog/sample.csv';
 Rows Loaded
-------------
           3
(1 row)

=> SELECT * FROM catalog;
  id  |   msrp   |                name                |             vendors             |        prices
------+----------+------------------------------------+---------------------------------+-----------------------
 1064 |   329.99 | 4k 48in TV                         | ["Amazon","Best Buy"]           | ["299.99","289.99"]
 1183 | 11999.99 | home theatre system                | ["Amazon","Bob's Hardware"]     | ["8949.99","9999.99"]
 1271 |  8999.99 | professional kitchen appliance set | ["Amazon","Designer Solutions"] | ["8299.99","8999.99"]
(3 rows)

2 - Table types

Vertica has several types of tables.

Vertica has several types of tables. During initial exploration, your focus should be on whatever is most expedient for you; you will modify or replace the tables you build at this stage before you put them in production. As you get closer to production-ready definitions, pay more attention to the specific considerations for each table type.

Native (ROS) tables

Native tables store their data in ROS format in the database. After you have finalized table definitions, you can optimize the performance of your most important or frequent queries by creating projections.

If you are exploring data that is in multiple locations, using a native table for exploration can be easier than using external tables. With a native table, you can load data from many sources into the database. If multiple people are exploring the same data together, you do not need to grant additional permissions for an external file system or object store. You only need one person to have permission to read the files being loaded.

On the other hand, if you make a change to a native table definition that is not compatible with existing data, the change fails:

=> SELECT name FROM catalog;
                name
------------------------------------
 4k 48in TV
 home theatre system
 professional kitchen appliance set
(3 rows)

=> ALTER TABLE catalog ALTER COLUMN name SET DATA TYPE VARCHAR(10);
ROLLBACK 2378:  Cannot convert column "name" to type "varchar(10)"
HINT:  Verify that the data in the column conforms to the new type

To make an incompatible change, you can create a new table and then use INSERT with SELECT to transform and copy existing data to the new table. For example, you can cast existing data to new types:

=> CREATE TABLE catalog2(
  id INT, msrp NUMERIC(8,2), name VARCHAR(10),
  vendors ARRAY[VARCHAR(50), 2],
  prices ARRAY[NUMERIC(10,2), 2] );

--- cannot directly insert:
=> INSERT INTO catalog2 SELECT * FROM catalog;
ERROR 4800:  String of 19 octets is too long for type Varchar(10)

-- violating array bounds is also an error:
=> INSERT INTO catalog2
SELECT id, msrp, name::VARCHAR(10), vendors, prices FROM catalog;
ERROR 9227:  Output array isn't big enough
DETAIL:  Type limit is 2 elements, but value has 3 elements

--- cast to truncate:
=> INSERT INTO catalog2
SELECT id, msrp, name::VARCHAR(10),
       vendors::ARRAY[VARCHAR(50), 2],
       prices::ARRAY[NUMERIC(10,2), 2] FROM catalog;
 OUTPUT
--------
      3
(1 row)

=> SELECT * FROM catalog2;
  id  |   msrp   |    name    |             vendors             |        prices

------+----------+------------+---------------------------------+-----------------------
 1064 |   329.99 | 4k 48in TV | ["Amazon","Best Buy"]           | ["299.99","289.99"]
 1183 | 11999.99 | home theat | ["Amazon","Bob's Hardware"]     | ["8949.99","9999.99"]
 1271 |  8999.99 | profession | ["Amazon","Designer Solutions"] | ["8299.99","8999.99"]
(3 rows)

External tables

With an external table, Vertica reads data at query time instead of storing it in the database. External tables are a good choice when you do not need the performance of native tables, for example for older data or partitions that you query less frequently.

External tables are particularly useful when the external data is growing or changing, because you do not need to reload data yourself. However, this means you discover structural changes such as changed column types or new columns at query time rather than at load time.

You can drop an external table and recreate it to change its definition. Because the database does not store the data, replacing a table is not a destructive operation. During data exploration, you can treat external tables as lightweight and disposable.

Because the data is read at query time, you must ensure that your users have and retain permission to read the data in its original location. Depending on where the data is stored, you might need to take additional steps to manage access, such as creating AWS IAM roles on S3.

For more information, see How external tables differ from native tables.

Flex tables

Flex tables are designed for data with an unknown or varying schema. They are a good choice for exploring data with varying fields or columns. JSON and delimited formats, in particular, do not embed a schema, so these formats are particularly prone to variation.

You can use flex tables as either native tables (CREATE FLEX TABLE) or external tables (CREATE FLEX EXTERNAL TABLE).

Flex tables trade flexibility for reduced performance and are not a good choice for production environments. You can use flex tables to explore data and then create standard (non-flex) native or external tables to deploy.

3 - Schema changes

After you develop a schema for a table and continue to load data, you might find that some new data diverges from that schema.

After you develop a schema for a table and continue to load data, you might find that some new data diverges from that schema. Sometimes new fields appear, and sometimes fields change data types or stop appearing. Depending on how volatile you expect your data to be, you might decide to load what you can and ignore the rest, or you might decide to update your table definition to handle the changes.

Typically, COPY rejects data that does not match the definition and loads the rest. You can monitor these rejections and adjust your table definition, as described in Test with Data. Parsers for some formats have additional considerations:

  • ORC: if the table definition does not include all columns in the table, the load aborts with an error.

  • Parquet: by default, the table definition must include all columns as for ORC. Alternatively, you can use loose schema matching to load only columns that are part of the table definition, ignoring others. Type mismatches that cannot be coerced are errors.

  • JSON and Avro: the parser loads only the columns that are part of the table definition. If the data contains other columns, the parsers emit warnings. Type mismatches that cannot be coerced are errors.

Parquet

If you load Parquet data that contains extra columns, by default the load fails with an error:

=> CREATE TABLE orders_summary(orderid INT, accountid INT);

=> COPY orders_summary FROM '/data/orders.parquet' PARQUET;
ERROR 9135:  Attempt to load 2 columns from a parquet source [/data/orders.parquet] that has 3 columns

If you do not know what the extra columns are, you can use inference to inspect a sample file. If the new fields are important, you can use ALTER TABLE to add the new columns to the table definition and then load the data.

If the new fields are not important and you want to ignore them, you can use loose schema matching. Loose schema matching means that the parser loads columns whose names match the table definition and ignores any others. Enable this with the do_soft_schema_match_by_name parameter in the PARQUET parser:

=> COPY orders_summary FROM '/data/orders.parquet'
   PARQUET(do_soft_schema_match_by_name='true');
 Rows Loaded
-------------
           3
(1 row)

Loose schema matching allows you to ignore extra data. Missing data and mismatched data types are still errors. If loads fail with these errors, you can use inference to inspect the data and then alter your table definition.

JSON and Avro

By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.

When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample file:

=> SELECT INFER_TABLE_DDL ('/data/rest1.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
...

=> CREATE TABLE restaurants(
       name VARCHAR(50),
       cuisine VARCHAR(25),
       location ARRAY[ROW(city VARCHAR(20), state VARCHAR(2)),50],
       hours ROW(open TIME, close TIME),
       menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100]);

=> COPY restaurants FROM '/data/rest1.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

A second data file has some new fields. These parsers load what can be loaded and warn about the new data:

=> COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER() NO COMMIT;
WARNING 10596:  Warning in UDx call in user-defined object [FJSONParser], code: 0, message:
Data source contained keys which did not match table schema
HINT:  SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type = 'UNMATCHED_KEY' GROUP BY key
 Rows Loaded
-------------
           2
(1 row)

Loads that use the NO COMMIT option, as in this example, can recover from errors. Without NO COMMIT, COPY loads the data, skipping the new fields. If you change your table definition to handle the new fields and repeat the load, the table has two rows for each record, one with the new fields and one without. Use NO COMMIT to experiment, and use ROLLBACK to recover.

New fields are logged in the UDX_EVENTS system table:

--- from the HINT:
=> SELECT key, SUM(num_instances) FROM v_monitor.UDX_EVENTS
   WHERE event_type = 'UNMATCHED_KEY' AND transaction_id=CURRENT_TRANS_ID()
   GROUP BY key;
          key           | SUM
------------------------+-----
 chain                  |   1
 menu.elements.calories |   7
(2 rows)

=> ROLLBACK;

The data file has two new fields, one at the top level and one in the existing menu complex-type column. To find out their data types, you can inspect the file or use INFER_TABLE_DDL:

=> SELECT INFER_TABLE_DDL ('/data/rest2.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                               INFER_TABLE_DDL
---------------------------------------------------------------------------------------------------------------------
 Candidate matched 1/1 of total files(s):
create table "restaurants"(
  "chain" bool,
  "cuisine" varchar,
  "hours" Row(
    "close" time,
    "open" time
  ),
  "location" Array[Row(
    "city" varchar,
    "state" varchar
  )],
  "menu" Array[Row(
    "calories" int,
    "item" varchar,
    "price" numeric(precision, scale)
  )],
  "name" varchar
);

(1 row)

You can use ALTER TABLE to add the new column and alter the menu column. With these changes, the data can be loaded without errors:

=> ALTER TABLE restaurants ADD COLUMN chain BOOLEAN;

=> ALTER TABLE restaurants ALTER COLUMN menu
SET DATA TYPE ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2), calories INT),100];

--- repeat the load:
COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

ALTER TABLE affects future loads. Data already in the table has null values for the added column and field:

=> SELECT EXPLODE(name, chain, menu) OVER() FROM restaurants;
     name     | chain | position |                          value
--------------+-------+----------+---------------------------------------------------------
 Pizza House  |       |        0 | {"item":"cheese pizza","price":"7.99","calories":null}
 Pizza House  |       |        1 | {"item":"spinach pizza","price":"8.99","calories":null}
 Pizza House  |       |        2 | {"item":"garlic bread","price":"4.99","calories":null}
 Sushi World  |       |        0 | {"item":"maki platter","price":"21.95","calories":null}
 Sushi World  |       |        1 | {"item":"tuna roll","price":"4.95","calories":null}
 Greasy Spoon | f     |        0 | {"item":"scrambled eggs","price":"3.99","calories":350}
 Greasy Spoon | f     |        1 | {"item":"grilled cheese","price":"3.95","calories":500}
 Greasy Spoon | f     |        2 | {"item":"tuna melt","price":"5.95","calories":600}
 Greasy Spoon | f     |        3 | {"item":"french fries","price":"1.99","calories":350}
 Pasta World  | t     |        0 | {"item":"cheese pizza","price":"7.99","calories":1200}
 Pasta World  | t     |        1 | {"item":"spinach pizza","price":"8.99","calories":900}
 Pasta World  | t     |        2 | {"item":"garlic bread","price":"4.99","calories":250}
(12 rows)

If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings and NO COMMIT.