Impala 3.2 版本以来的新特性

impala 和CDH的版本:

CDH 6.3.2  impala 3.2

CDP 7.1.x  Impala 3.4

impala 3.4 将是Impala 3版本的最后一个版本,此后社区的重点将是Impala 4 版本

Impala 4 新特性:


Impala在IO层是多线程的,每个executor(impalad)多线程地从多个磁盘以及网络读取数据。但在计算层仍是每个fragment instance只有一个线程在执行。支持多线程计算能有效提高CPU bound类型查询的执行效率,具体可追踪 IMPALA-3902 下的 issue(登陆可见).

3.Hive3 ACID表的支持
现在Impala支持Hive3中insert-only transactional表的读写,目前正在支持full ACID表的读取操作,具体可追踪 IMPALA-9042.
4.Query retry and fault tolerance (IMPALA-9124)
5.更全面的嵌套类型支持,如允许嵌套类型出现在SelectList、函数参数/返回值等(IMPALA-9494, IMPALA-9518, IMPALA-9514, IMPALA-9516, IMPALA-9521)
6.Catalogd 细粒度元数据加载和缓存 (IMPALA-8937)
7.支持 Kudu Bloom Filter (IMPALA-3741)
8.支持 Iceberg (IMPALA-9621)
9.支持 aarch64 (IMPALA-9376)
10.实现 Robin Hood Hash Table (IMPALA-9434)
11.Impala-shell 支持 Python3 (IMPALA-3343)


Impala 3.4 新特性:

1.支持 Hive Insert-Only Transactional Tables


3.DATE Data Type Supported in Avro Tables

4.Primary Key and Foreign Key Constraints

5.Enhanced External Kudu Table

6.Ranger Column Masking

7.Experimental Support for Apache Hudi
  Impala 3.4 开始支持读取 Hudi Read Optimized 格式的表,Apache Hudi 是一个开源的,支持插入、更新、删除

8.ORC Reads Enabled by Default

9.Support for ZSTD and DEFLATE

10.Query Profile Exported to JSON

Query Option for Disabling HBase Row Estimation:DISABLE_HBASE_NUM_ROWS_ESTIMATE 
Query Option for Controlling Size of Parquet Splits on Non-block Stores:
  The default value of the PARQUET_OBJECT_STORE_SPLIT_SIZE query option is 256 MB.

Capacity Quota for Scratch Disks: ‑‑scratch_dirs=/dir1:5MB,/dir2.

Cookie-based Authentication:--max_cookie_lifetime_s startup
Server-side Spooling of Query Results:SPOOL_QUERY_RESULTS 

12.Object Ownership Support
Object ownership for tables, views, and databases is enabled by default in Impala. When you create a database, 
a table, or a view, as the owner of that object, you implicitly have the privileges on the object. 
The privileges that owners have are specified in Ranger on the special user, {OWNER}.

The {OWNER} user must be defined in Ranger for the object ownership privileges work in Impala.

支持 SQL:2016 标准 datetime 模式,即 CAST(… AS … FORMAT <template>) 语法里的 template 部分支持更多关键字

Impala 3.3 新特性:

1.Increased Compatibility with Apache Projects:
  Impala is integrate with the following components:
Apache Ranger: Use Apache Ranger to manage authorization in Impala.
Apache Atlas: Use Apache Atlas to manage data governance in Impala.
Hive 3

2.Parquet Page Index
To improve performance when using Parquet files, Impala can now write page indexes in Parquet files 
and use those indexes to skip pages for the faster scan.

3.Support for Kudu Integrated with Hive Metastore
In Impala 3.3 and Kudu 1.10, Kudu is integrated with Hive Metastore (HMS), and from Impala, you can create,
 update, delete, and query the tables in the Kudu services integrated with HMS.

4.Parquet files:
Zstd Compression for Parquet files
Zstandard (Zstd) is a real-time compression algorithm offering a tradeoff between speed and ratio of compression. 
Compression levels from 1 up to 22 are supported. The lower the level, the faster the speed at the cost of
 compression ratio.

Lz4 Compression for Parquet files
Lz4 is a lossless compression algorithm providing extremely fast and scalable compression and decompression.

5.Metadata Performance Improvements
The following features to improve metadata performance are enabled by default in this release:

Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.

impaladcoordinators fetch incremental stats from catalogd on-demand, reducing the memory footprint and the network requirements for broadcasting metadata.

Time-based and memory-based automatic invalidation of metadata to keep the size of metadata bounded and to reduce the chances of catalogdcache running out of memory.

Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

6.Scalable Pool Configuration in Admission Controller
To offer more dynamic and flexible resource management, Impala supports the new configuration parameters 
that scale with the number of hosts in the resource pool. You can use the parameters to control the number of 
running queries, queued queries, and maximum amount of memory allocated for Impala resource pools.

7.Query Profile
The following information was added to the Query Profile output for better monitoring and troubleshooting of query performance.

Network I/O throughput

System disk I/O throughput

8.DATE Data Type and Functions
You can use the new the DATE type to describe a particular year/month/day, in the form YYYY-­MM-­DD.

This initial DATE type support the TEXT, Parquet, and HBASE file formats.

The support of DATE data type includes the following features:

DATE type column as a partitioning key column
DATE literal
Implicit casting between DATE and other types: STRING and TIMESTAMP
Most of the built-in functions for TIMESTAMP now allow the DATE type arguments, as well.

9.Support Hive Insert-Only Transactional Tables
Impala added the support to create, drop, query, and insert into the insert-only type of transactional tables.

10.HiveServer2 HTTP Connection for Clients
Now client applications can connect to Impala over HTTP via HiveServer2 with the option to use the Kerberos SPNEGO and LDAP for authentication. See Impala Clients for details.

11.Default File Format Changed to Parquet
When you create a table, the default format for that table data is now Parquet.

For backward compatibility, you can use the DEFAULT_FILE_FORMAT query option to set the default file format to the previous default, text, or other formats.

12.Built-in Function to Process JSON Objects
The GET_JSON_OBJECT() function extracts JSON object from a string based on the path specified and returns the extracted JSON object.

13.Ubuntu 18.04
This version of Impala is certified to run on Ubuntu 18.04.


Impala 3.2 新特性:

1.Multi-cluster Support
Remote File Handle Cache
Impala can now cache remote HDFS file handles when the cache_remote_file_handles impalad flag is set to true.
 This feature does not apply to non-HDFS tables, such as Kudu or HBase tables, and does not apply to the 
tables that store their data on cloud services, such as S3 or ADLS. 

2.Enhancements in Resource Management and Admission Control

 .Admission Debug page is available in Impala Daemon (impalad) web UI at \admission and provides the 
following information about Impala resource pools:
  Pool configuration
  Relevant pool stats
  Queued queries in order of being queued (local to the coordinator)
  Running queries (local to this coordinator)
  Histogram of the distribution of peak memory usage by admitted queries
  A new query option, NUM_ROWS_PRODUCED_LIMIT, was added to limit the number of rows returned from queries.
 .Impala will cancel a query if the query produces more rows than the limit specified by this query option. 
  The limit applies only when the results are returned to a client, e.g. for a SELECT query, but not an 
INSERT query. This query option is a guardrail against users accidentally submitting queries that 
return a large number of rows.

3.Compatibility and Usability Enhancements:

3.1 Impala can now read the TIMESTAMP_MILLIS and TIMESTAMP_MICROS Parquet types. 
3.2 Impala can now read the complex types in ORC such as ARRAY, STRUCT, and MAP 
3.3 The LEVENSHTEIN string function is supported.
    The function returns the Levenshtein distance between two input strings, the minimum number of single-character edits required to transform one string to other.

3.4 The IF NOT EXISTS clause is supported in the ALTER TABLE statement.
3.5  The new DEFAULT_FILE_FORMAT query option allows you to set the default table file format. This removes
 the need for the STORED AS <format> clause. Set this option if you prefer a value that is not TEXT. 
The supported formats are:

3.6 The extended or verbose EXPLAIN output includes the following new information for queries:
The text of the analyzed query that may have been rewritten to include various optimizations and implicit casts.
The implicit casts and literals shown with the actual types.
CPU resource utilization (user, system, iowait) metrics were added to the Impala profile output.

4.Security Enhancement:
  The REFRESH AUTHORIZATION statement was implemented for refreshing authorization data.


已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 技术工厂 设计师:CSDN官方博客 返回首页