Use AWS Cloud Services with KNIME Analytics Platform

Overview

KNIME Analytics Platform includes a set of nodes to interact with Amazon Web Services (AWS™). They allow you to create connections to Amazon services, such as Amazon EMR, or Amazon S3.

The KNIME Amazon Cloud Connectors Extension is available on KNIME Hub.

Create an Amazon EMR cluster

This section describes a step-by-step guide on how to create an EMR cluster.

The following guide aims to create a standard EMR Spark cluster for testing and education purposes. Please modify the settings and configurations according to your needs.

The following prerequisites are necessary before launching an EMR cluster:

An Amazon AWS account. To sign up please follow the instructions provided in the AWS documentation.
An Amazon S3 bucket. The bucket is needed to exchange data between KNIME and Spark and to store the cluster log files. To create an Amazon S3 bucket, please follow the AWS documentation.

After all the prerequisites are fulfilled, you can create the EMR cluster:

In the AWS web console, go to EMR
Click the button Create cluster at the top of the page

02 create cluster button

While in the cluster creation page, navigate to the Advanced options

02 advanced options

Under Software Configuration, choose the software to be installed within the cluster. If you want to use Livy and KNIME Spark nodes, install Livy and Spark by checking the corresponding checkboxes.

02 software configuration

In Edit software settings, you can override default configurations for applications such as Spark. For example, set the Spark property maximizeResourceAllocation to true to allow executors to use maximum resources on each node in a cluster. This option works only on a pure Spark cluster (without Hive running in parallel).

02 spark property

Under Hardware Configuration, you can specify the EC2 instance types, number of EC2 instances to initialize in each node, and the purchasing option, depending on your budget. For a standard cluster, it is enough to use the default configuration. The rest of the settings you can keep by default values, or adjust them according to your needs.

For more information on the hardware and network configuration, please check the AWS documentation. For a more in-depth guidance about the optimal number of instances and other related things, please check the corresponding guidelines in the AWS documentation as well.

02 hardware configuration

Under General Options, enter the cluster name. Termination Protection is enabled by default and is important to prevent accidental termination of the cluster. To terminate the cluster, you must disable termination protection.
Under Security options, there is an option to specify the EC2 key pair. You can proceed without an EC2 key pair, but if you do have one and you want to SSH into the EMR cluster later, you can provide it here.

Further down the page, you can also specify the EC2 security group. It acts as a virtual firewall around your cluster and controls all inbound and outbound traffic of your cluster nodes. A default EMR-managed security group is created automatically for your new cluster, and you can edit the network rules in the security group after the cluster is created. Follow the instructions in the AWS documentation on how to work with EMR-managed security groups.

If needed, add your IP to the Inbound rules to enable access to the cluster.

To make some AWS services accessible from KNIME Analytics Platform, you need to enable specific ports of the EMR master node. For example, Hive is accessible via port 10000.

Click Create cluster to launch it. It might take a few minutes until all resources are available. The cluster is ready when the status shows Waiting next to the cluster name:

02 cluster ready

Connect to S3

You will need the Amazon Authentication node and Amazon S3 Connector node to create a connection to Amazon S3 from within KNIME Analytics Platform. For more details, please check out the KNIME File Handling Guide.

You can check whether a connection can be successfully established by clicking on the Test connection button in the configuration dialog of the Amazon Authentication node. A new pop-up window will appear showing the connection information in the format of s3://accessKeyId@region and whether a connection is successfully created.

After the connection to Amazon S3 is established, you can then use a variety of the KNIME file handling nodes to manage files on Amazon S3:

03 s3 example

The KNIME file handling nodes are available in the node repository under IO.

For more information on Amazon S3, please check out the AWS Documentation.

Apache Hive

This section describes how to establish a connection to Hive on EMR in KNIME Analytics Platform.

This workflow connects to Hive on EMR and creates a Hive table:

05 hive example

Register the Amazon JDBC Hive driver

To register the Amazon JDBC Hive driver in KNIME Analytics Platform:

Download the driver from the AWS website
Extract the .zip file and the desired driver version
Follow the instruction in the Database Extension Guide on how to register an external JDBC driver in KNIME.

For more information about the Amazon JDBC Hive driver, please check the AWS documentation.

Hive Connector

The Hive Connector node creates a connection via JDBC to a Hive database. The output of this node is a database connection that can be used with the standard KNIME database nodes.

05 hive connector

Inside the node configuration dialog, specify:

Database dialect and driver name. The driver name is the name given to the driver when registering the driver.
Server hostname (or IP address), the port, and a database name
Authentication mechanism. By default, the username hdfs can be used as username without a password.

For more information about advanced options inside the Connector node, see the KNIME Database Extension Guide.

HDFS

To upload or work with remote files on the EMR cluster, it is recommended to use the HDFS Connector node. Amazon EMR 5.x can use hdfs or hadoop as HDFS administrator user.

Amazon Athena

This section describes Amazon Athena and how to connect to it, as well as create an Athena table, via KNIME Analytics Platform.

Amazon Athena is a query service where users are able to run SQL queries against their data that are located on Amazon S3. In Athena, databases and tables contain basically the metadata for the underlying source data. For each dataset, a corresponding table needs to be created in Athena. The metadata contains information such as the location of the dataset in Amazon S3, and the structure of the data, e.g. column names, data types, and so on.

The KNIME Amazon Athena Connector Extension is available on KNIME Hub.

Athena only reads your data on S3, you can't add or modify it.

06 example athena

Connect to Amazon Athena

Due to license restriction you need to download the Athena JDBC driver from Amazon and register it once prior connecting to Athena. To download the driver please click here and download the latest version of the JDBC driver without the AWS SDK e.g. AthenaJDBC42-2.0.35.1001.jar. Once downloaded register the driver via the KNIME preference page with athena as database type as described in the Database Extension Guide.

To connect to Amazon Athena via KNIME Analytics Platform:

Use the Amazon Authentication node to create a connection to AWS services. In the node configuration dialog please provide the AWS access key ID and secret access key. For more information about AWS access keys, see the AWS documentation.
The Amazon Athena Connector node creates a connection to Athena through the built-in Athena JDBC driver. Please provide the following information in the node configuration dialog:
a. The hostname of the Athena server. It has the format of athena.<REGION_NAME>.amazonaws.com. For example: athena.eu-west-1.amazonaws.com.
b. Name of the S3 staging directory to store the query result. For example, s3://aws-athena-query-results-eu-west-1/.

06 athena connector node

After executing this node, a connection to Athena will be established. But before you can start querying data located in S3, you have to create the corresponding Athena table.

Create an Athena table

Creating an Athena table in KNIME Analytics Platform requires a SQL statement, where you have to build your own CREATE TABLE statement. The example below shows a CREATE TABLE statement to create a table for the Amazon CloudFront log dataset which is a part of the public example Athena dataset made available at s3://athena-examples-<YOUR-REGION>/cloudfront/plaintext/.

After building your own CREATE TABLE statement, copy the statement to the node configuration dialog of DB SQL Executor node.

sql

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\\(]+[\\(]([^\\;]+).*\\%20([^\\/]+)[\\/](.*)$"
) LOCATION 's3://athena-examples-\<YOUR-REGION\>/cloudfront/plaintext/';

Once the DB SQL Executor node is executed, the corresponding Athena table that contains metadata of the data files is created. Now you can query the files using the standard KNIME database nodes.

If you are not familiar with SQL and prefer to do it interactively, you can also create the table using the Athena web console. This way, you can even let AWS Glue Crawlers to detect the file schema (column names, column types, among other things) automatically instead of entering them manually. Follow the tutorial in the Athena documentation for a more in-depth explanation.

An example workflow to demonstrate the usage of the Athena Connector node to connect to Amazon Athena from within KNIME Analytics Platform is available on KNIME Hub.

Execute Spark jobs on an EMR cluster

This section describes how to configure and run a Spark job on an EMR cluster from within KNIME Analytics Platform. Before running a Spark job on an EMR cluster, a Spark context has to be created. To create a Spark context via Livy, use the Create Spark Context (Livy) node.

Create Spark Context (Livy) node

The Create Spark Context (Livy) node creates a Spark context via Apache Livy. The node has a remote connection port (blue) as input. The idea is that this node needs to have access to a remote file system to store temporary files between KNIME and the Spark context.

A wide array of file systems are supported, such as HDFS, webHDFS, httpFS, Amazon S3, Azure Blob Store, and Google Cloud Storage. However, please note that using, e.g HDFS is complicated on a remote cluster because the storage is located on the cluster, hence any data that is stored there will be lost as soon as the cluster is terminated.

The recommended and easy way is to use Amazon S3. Please check the previous section Connect to S3 on how to establish a connection to Amazon S3.

The other connector nodes are available under IO > Connectors inside the node repository.

Open the node configuration dialog of the Create Spark Context (Livy) node and provide the following information:

The Spark version. The version has to be the same as the one used by Livy. Otherwise the node will fail. You can find the Spark version in the cluster summary page, or in the Software configuration step during cluster creation on the Amazon EMR web console.
The Livy URL including the protocol and port e.g. http://localhost:8998. You can find the URL in the cluster summary page on the Amazon EMR web console. Then simply attach the default port 8998 to the end of the URL:

04 livy url

Select the authentication method. Usually no authentication is required, but if, for example, you have setup a Kerberos authentication on the cluster, you can also use it in KNIME. If that is the case, then you have to set up Kerberos in KNIME Analytics Platform first. Please check the KNIME Kerberos documentation for more details.
Under Spark executor resources, you can manually set memory and cores for each Spark executor. There are three possible allocation strategies: default, fixed, and dynamic.
Under Advanced tab, set the staging area for Spark jobs. For Amazon S3, a staging directory is mandatory. You can also override default Spark driver resources (memory and cores) and specify custom Spark settings.

04 create spark context node

After the Create Spark Context (Livy) node is executed, the output Spark node (grey) will contain the newly created Spark context. It allows executing Spark jobs via KNIME Spark nodes.

For a more in-depth explanation on how to read and write data between a remote file system and Spark DataFrame via KNIME Analytics Platform, please check out the KNIME Databricks documentation.

This example employs a Random Forest algorithm to train a prediction model on a dataset, executed on an EMR Spark cluster:

04 emr example

An example workflow to demonstrate the usage of Amazon EMR from within KNIME Analytics Platform is available on KNIME Hub.

Tutorials

Tutorials

Concepts

Connecting

Reading and Transforming

Writing and Modifying

Reference

Tutorials

Data Processing

Environment Management

Alternative Configurations

Concepts

Reference

API Reference

Use AWS Cloud Services with KNIME Analytics Platform

Overview

Create an Amazon EMR cluster

Connect to S3