Flex tables

Flex tables are designed for data with an unknown or varying schema.

Flex tables are designed for data with an unknown or varying schema. While flex tables generally do not meet the performance needs of a production system, you can use them to explore data and develop a table definition. You can use flex tables with all data formats, including delimited (CSV).

A flex table has a special column, named __raw__, that contains all fields from data you load into it. Each row is independent and might have different fields in this column. You can use flex functions to explore this data and materialize individual columns in your table. When you have identified the columns that should consistently be part of your table, you can then define a regular (non-flex) table.

You can iterate by loading and exploring small amounts of data, as described in Inference. However, flex tables allow you to load larger amounts of data without producing errors and rejections, so you might choose to start with a larger sample.

Column discovery

The simplest flex table does not specify any columns. In this example, the first thousand rows are pulled from a much larger data set:

=> CREATE FLEX TABLE reviews_flex();

=> COPY reviews_flex FROM '/home/data/reviews/sample_1000.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
        1000
(1 row)

The COMPUTE_FLEXTABLE_KEYS function extrapolates field names and candidate data types from the raw data. The more data you have loaded the more work this function needs to do and the longer it takes to finish. If the function runs slowly for you, try a smaller sample size.

=> SELECT COMPUTE_FLEXTABLE_KEYS('reviews_flex');
             COMPUTE_FLEXTABLE_KEYS
-------------------------------------------------
 Please see public.reviews_flex_keys for updated keys
(1 row)

SELECT * FROM reviews_flex_keys;
  key_name   | frequency | data_type_guess
-------------+-----------+-----------------
 user_id     |      1000 | Varchar(44)
 useful      |      1000 | Integer
 text        |      1000 | Varchar(9878)
 stars       |      1000 | Numeric(5,2)
 review_id   |      1000 | Varchar(44)
 funny       |      1000 | Integer
 date        |      1000 | Timestamp
 cool        |      1000 | Integer
 business_id |      1000 | Varchar(44)
(9 rows)

In this example, all 1000 rows have all of the fields. This suggests that the larger data set is probably consistent, particularly if the sample was taken from the middle. In this case, the data was exported from another database. Had the data been more varied, you would see different numbers of occurrences in the keys table.

COMPUTE_FLEXTABLE_KEYS, unlike INFER_TABLE_DDL, proposes lengths for VARCHAR values and precision and scale for NUMERIC values. The function uses the largest values found in the data in the table. Future data could exceed these limits, so make sure you are using representative data or a large-enough sample.

If you query a flex table using SELECT *, the results include the raw column. This column is in a binary format, so querying it is not especially useful. To see a sample of the data, query the fields explicitly, as if they were columns. The following example omits the text of the review, which COMPUTE_FLEXTABLE_KEYS reported as potentially quite large (VARCHAR(9878)):

=> \x
SELECT user_id, useful, stars, review_id, funny, date, cool, business_id
FROM reviews_flex ORDER BY date DESC LIMIT 3;
-[ RECORD 1 ]-----------------------
user_id     | mY8vqRndQ_jHeI-kUZc-rw
useful      | 0
stars       | 5.0
review_id   | viMRTXxVGZa_E0J5I3fD5A
funny       | 0
date        | 2021-01-28 01:25:04
cool        | 0
business_id | LsR1rENf9596Oo07QFjtaQ
-[ RECORD 2 ]-----------------------
user_id     | vLKy-xJoyeSAVhPbG8-Pfg
useful      | 0
stars       | 5.0
review_id   | 8Bcq-HlIpdDJ773AkNAUeg
funny       | 0
date        | 2021-01-27 16:08:53
cool        | 0
business_id | ljNN92X-WRVzhkHeIgnllA
-[ RECORD 3 ]-----------------------
user_id     | yixdPMh2UidS3qOquE1Yow
useful      | 0
stars       | 5.0
review_id   | h_z6g8lOrvUgsLg-bklxVg
funny       | 0
date        | 2021-01-27 15:58:42
cool        | 0
business_id | WlTPQwhKyEDogIXQfP0uYA

This sample shows several integer fields with values of 0. These fields (useful, funny, and cool) appear to be counters indicating how many people found the review to have those qualities. To see how often they are used, which might help you decide whether to include them in your table, you can use the COUNT function:

=> SELECT COUNT(useful) FROM reviews_flex WHERE useful > 0;
 COUNT
-------
   548
(1 row)

=> SELECT COUNT(funny) FROM reviews_flex WHERE funny > 0;
 COUNT
-------
   234
(1 row)

=> SELECT COUNT(cool) FROM reviews_flex WHERE cool > 0;
 COUNT
-------
   296
(1 row)

All of the fields in this data except text are fairly small. How much of the review text you want to include depends on how you plan to use the data. If you plan to analyze the text, such as to perform sentiment analysis, you might set a larger size. If you don't plan to use the text, you might either ignore the text entirely or set a small size and allow the data to be truncated on load:

=> CREATE TABLE reviews(
    review_id VARCHAR(44), business_id VARCHAR(44), user_id VARCHAR(44),
    stars NUMERIC(5,2), date TIMESTAMP,
    useful INT, funny INT, cool INT);

=> COPY reviews FROM '/home/data/reviews/reviews_set1.json' PARSER FJSONPARSER() NO COMMIT;
WARNING 10596:  Warning in UDx call in user-defined object [FJSONParser], code:
0, message: Data source contained keys which did not match table schema
HINT:  SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type
 = 'UNMATCHED_KEY' GROUP BY key
 Rows Loaded
-------------
       10000
(1 row)

The JSON parser warns about keys found in the data that are not part of the table definition. Some other parsers require you to consume all of the fields. Consult the reference pages for specific parsers for more information about whether and how you can ignore fields.

To find out what unmatched fields appeared in the data, query the UDX_EVENTS system table:

=> SELECT key, SUM(num_instances) FROM UDX_EVENTS
WHERE event_type='UNMATCHED_KEY' AND transaction_id = CURRENT_TRANS_ID()
GROUP BY key;
 key  |  SUM
------+-------
 text | 10000
(1 row)

The UDX_EVENTS table shows that the only unmatched field is the previously-seen text field. You can safely commit the load transaction. When you have tested with enough data to feel confident, you can suppress the parser warning as explained on FJSONPARSER.

Complex types

Consider the following JSON data in a flex table:

=> CREATE FLEX TABLE products_flex();
COPY products_flex FROM '/home/data/products/sample.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           3
(1 row)

=> SELECT COMPUTE_FLEXTABLE_KEYS('products_flex');
                 COMPUTE_FLEXTABLE_KEYS
---------------------------------------------------------
 Please see public.products_flex_j_keys for updated keys
(1 row)

=> SELECT * FROM products_flex_keys;
  key_name   | frequency |   data_type_guess
-------------+-----------+---------------------
 description |         3 | Varchar(58)
 id          |         3 | Integer
 msrp        |         3 | Numeric(10,3)
 name        |         3 | Varchar(68)
 vendors     |         3 | long varbinary(298)
(5 rows)

The vendors field has a type of LONG VARBINARY, but the name of the column implies a collection. You can use the MAPTOSTRING function to sample the data:

=> SELECT MAPTOSTRING(__raw__) FROM products_flex LIMIT 1;

                                                MAPTOSTRING
-------------------------------------------------------------------------------
 {
        "description": "something long goes here",
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "vendors": {
                "0.name": "Amazon",
                "0.price": "299.99",
                "1.name": "Best Buy",
                "1.price": "289.99",
                "2.name": "Bob's Hardware",
                "2.price": "309.99"
        }
}
(1 row)

By default, the JSON parser "flattens" JSON maps (key/value pairs). In this example, the value of the vendors field is presented as a set of six key/value pairs, but the key names (number.name and number.price) indicate that the data is more structured. Disabling this flattening makes the true structure of the JSON data more clear:

=> TRUNCATE TABLE products_flex;

=> COPY products_flex FROM '/home/data/products/sample.json'
PARSER FJSONPARSER();
 Rows Loaded
-------------
           3
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM products_flex LIMIT 1;

                                                 MAPTOSTRING
--------------------------------------------------------------------------------
 {
        "description": "something long goes here",
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "vendors": {
                "0": {
                        "name": "Amazon",
                        "price": "299.99"
                },
                "1": {
                        "name": "Best Buy",
                        "price": "289.99"
                },
                "2": {
                        "name": "Bob's Hardware",
                        "price": "309.99"
                }
        }
}
(1 row)

This data indicates that the vendors field holds an array of structs with two fields, name and price. Using this information and the computed keys for the other fields in the data, you can define a table with strongly-typed columns:

=> CREATE TABLE products(
    id INT, name VARCHAR(100), msrp NUMERIC(8,2), description VARCHAR(1000),
    vendors );

After loading data, you can use an array function to confirm that the collection was loaded correctly:

=> COPY products FROM '/home/data/products/sample.json'
PARSER FJSONPARSER(flatten_maps=false);
 Rows Loaded
-------------
           3
(1 row)

=> SELECT EXPLODE(vendors) OVER() FROM products;
 position |                      value
----------+-------------------------------------------------
        0 | {"name":"Amazon","price":"299.99"}
        1 | {"name":"Best Buy","price":"289.99"}
        2 | {"name":"Bob's Hardware","price":"309.99"}
        0 | {"name":"Amazon","price":"8949.99"}
        1 | {"name":"Bob's Hardware","price":"9999.99"}
        0 | {"name":"Amazon","price":"8299.99"}
        1 | {"name":"Designer Solutions","price":"8999.99"}
(7 rows)

Handling limitations of some file formats

The JSON format supports arrays, structs, and combinations of the two. Some formats are more limited. Delimited (CSV) data can represent arrays of scalar types, but not structs. Further, not all parsers that you can use with flex tables support complex types.

Consider the following delimited data:

id|msrp|name|vendors|prices
1064|329.99|4k 48in TV|[Amazon,Best Buy]|[299.99,289.99]
1183|11999.99|home theatre system|[Amazon,Bob's Hardware]|[8949.99,9999.99]
1271|8999.99|professional kitchen appliance set|[Amazon,Designer Solutions]|[829
9.99,8999.99]

You can load this data into a flex table using FDELIMITEDPARSER, but the parser misinterprets the arrays:

=> CREATE FLEX TABLE catalog_flex();

=> COPY catalog_flex FROM '/home/data/catalog/sample.csv' PARSER FDELIMITEDPARSER();
 Rows Loaded
-------------
           3
(1 row)

=> SELECT MAPTOSTRING(__raw__) FROM catalog_flex LIMIT 1;
                                                  MAPTOSTRING
--------------------------------------------------------------------------------
 {
        "id": "1064",
        "msrp": "329.99",
        "name": "4k 48in TV",
        "prices": ,
        "vendors": 
}
(1 row)

=> SELECT COMPUTE_FLEXTABLE_KEYS('catalog_flex');
                COMPUTE_FLEXTABLE_KEYS
------------------------------------------------------
 Please see public.catalog_flex_keys for updated keys
(1 row)

=> SELECT * FROM catalog_flex_keys;
 key_name | frequency | data_type_guess
----------+-----------+-----------------
 id       |         3 | Integer
 msrp     |         3 | Numeric(10,3)
 name     |         3 | Varchar(68)
 prices   |         3 | 
 vendors  |         3 | 
(5 rows)

The parser has mistaken the arrays as strings, and COMPUTE_FLEXTABLE_KEYS has mistaken an array of numbers for an INTERVAL and an array of strings as a VARCHAR. These are limitations of flex tables used with certain file formats, not of Vertica in general. If you see unusual results, try adding real columns to the flex table:

=> CREATE FLEX TABLE catalog_flex(
        vendors ARRAY[VARCHAR(50), 10],
        prices ARRAY[NUMERIC(10,2), 10] );
WARNING 5727:  Sort clause contains a Array attribute, catalog_flex.prices - data loads may be slowed significantly
HINT:  Consider using a non-long type column for sorting or adding at least one such column to the table

For data exploration you need not be concerned with warnings such as this one. When you are ready to define a production table, heed any warnings you receive.

After loading the data, you can confirm that the arrays were loaded correctly:

=> SELECT EXPLODE(vendors, prices USING PARAMETERS explode_count=2) OVER()
AS (v_idx,vendor,p_idx,price)
FROM catalog_flex;
 v_idx |       vendor       | p_idx |  price
-------+--------------------+-------+---------
     0 | Amazon             |     0 |  299.99
     0 | Amazon             |     1 |  289.99
     1 | Best Buy           |     0 |  299.99
     1 | Best Buy           |     1 |  289.99
     0 | Amazon             |     0 | 8949.99
     0 | Amazon             |     1 | 9999.99
     1 | Bob's Hardware     |     0 | 8949.99
     1 | Bob's Hardware     |     1 | 9999.99
     0 | Amazon             |     0 | 8299.99
     0 | Amazon             |     1 | 8999.99
     1 | Designer Solutions |     0 | 8299.99
     1 | Designer Solutions |     1 | 8999.99
(12 rows)

Create table definition

You can continue to use flex tables to explore data, creating real columns for complex types as you encounter them. FDELIMITEDPARSER is specific to flex tables; for native tables (created with CREATE TABLE), DELIMITED is the default parser:

=> CREATE TABLE catalog(
  id INT, msrp NUMERIC(8,2), name VARCHAR(100),
  vendors ARRAY[VARCHAR(50), 10], prices ARRAY[NUMERIC(10,2), 10] );

=> COPY catalog FROM '/home/data/catalog/sample.csv';
 Rows Loaded
-------------
           3
(1 row)

=> SELECT * FROM catalog;
  id  |   msrp   |                name                |             vendors             |        prices
------+----------+------------------------------------+---------------------------------+-----------------------
 1064 |   329.99 | 4k 48in TV                         | ["Amazon","Best Buy"]           | ["299.99","289.99"]
 1183 | 11999.99 | home theatre system                | ["Amazon","Bob's Hardware"]     | ["8949.99","9999.99"]
 1271 |  8999.99 | professional kitchen appliance set | ["Amazon","Designer Solutions"] | ["8299.99","8999.99"]
(3 rows)