KNIME File Handling Guide


Introduction

With the move towards cloud and hybrid environments we had to rethink and rewrite the existing file handling infrastructure in KNIME Analytics Platform to provide our users with a better user experience.

With KNIME Analytics Platform release 4.3 we introduced a new file handling framework thanks to which you can migrate workflows between file systems or manage various file systems within the same workflow in a more convenient way.

In this guide the following topics are covered:

  • Basic concepts relative to file systems

  • How different types of file systems are accessible from within KNIME Analytics Platform

  • The reading and writing from and to different file systems and conveniently transform and adjust your data tables when importing them into your workflow

  • The new Path type and how to use it within the nodes that are build on the basis of the file handling framework. It will also explain the differences with the KNIME URL and how to migrate between the two frameworks.

Basic concepts about file systems

In general, a file system is a process that manages how and where data is stored, accessed and managed.

In KNIME Analytics Platform a file system can be seen as a forest of trees where a folder constitutes an inner tree node, while a file or an empty folder are the leaves.

Working directory

A working directory is a folder used by KNIME nodes to disambiguate relative paths. Every file system will have a working directory whether it is explicitly configured or implicitly.

Hidden files

Within KNIME Analytics Platform hidden files and folders are not displayed when browsing a file system. However they can be referenced knowing the path. Hidden files currently only exist for the local file systems in KNIME Analytics Platform:

  • On Linux and MacOS their filename starts with a dot "."

  • On Windows instead they are treated as regular files and folders

Path syntax

A path is a string that identifies a file or folder position within a file system. The path syntax depends on the file system, e.g. a Windows local file system might look like C:\Users\username\file.txt, while on Linux and most other file systems in KNIME Analytics Platform might look like /folder1/folder2/file.txt.

Paths can be distinguished in:

  • Absolute: An absolute path uniquely identifies a file or folder. It starts always with a file system root.

  • Relative: A relative path does not identifies one particular file or folder. It is used to identify a file or folder relative to an absolute path.

KNIME Analytics Platform and file systems

Different file systems are available to use with KNIME Analytics Platform. Reader and writer nodes are able to work with all supported file systems.

File systems within KNIME Analytics Platform can be divided into two main categories:

Standard file systems

Standard file systems are available at any time meaning they do not need a connector node to connect to it.

Their working directory is pre-configured and does not need to be explicitly specified.

To use a reader node to read a file from a standard file system drag and drop the reader node for the file type you want to read, e.g. CSV Reader for a .csv file, into the workflow editor from the node repository.

Right click the node and choose Configure…​ from the context menu. In the Input location pane under the tab Settings you can choose the file system you want to read from in a drop-down menu.

03 csv reader config standard fs

The following standard file systems are available in KNIME Analytics Platform:

Local file system

When reading from Local File System the path syntax to use will be dependent on the system on which the workflow is executing, i.e. if Windows or UNIX operative system.

The working directory will be implicit and will correspond to the system root directory.

03 csv reader config local fs
You can also have access to network shares supported by your operating system, via the local file system within KNIME Analytics Platform.

Please be aware that when executing a workflow on KNIME Server version 4.11 or higher the local file system access is disabled for security reasons. The KNIME Server administrators can activate it, but this is not recommended. For more information about this please refer to the KNIME Server Administration Guide.

Mountpoint

With the Mountpoint option you will have access to KNIME mountpoints, such as LOCAL, your KNIME Server mountpoints, if any, and the KNIME Hub. You have to be logged in to the specific mountpoint to have access to it.

The path syntax will be UNIX-like, i.e. /folder1/folder2/file.txt and relative to the implicit working directory, which corresponds to the root of the mountpoint.

03 csv reader config mountpoint fs

Please note that workflows inside the mountpoints are treated as files, so it is not possible to read or write files inside a workflow.

Relative to

With the Relative to option you will have access to three different file systems:

  • Current mountpoint and Current workflow: The file system corresponds to the mountpoint where the currently executing workflow is located. The working directory is implicit and it is:

    • Current mountpoint: The working directory corresponds to the root of the mountpoint

    • Current workflow: The working directory corresponds to the path of the workflow in the mountpoint, e.g. /workflow_group/my_workflow.

  • Current workflow data area: This file system is dedicated to and accessible by the currently executing workflow. Data are physically stored inside the workflow and are copied, moved or deleted together with the workflow.

All the paths used with the option Relative to are of the type folder/file and they must be relative paths.

03 csv reader config relativetodata fs

In the example above you will read a .csv file from a folder data which is in:

<knime-workspace>/workflow_group/my_workflow/data/

Please note that workflows are treated as files, so it is not possible to read or write files inside a workflow.

When the workflow is executing on KNIME Server the options Relative toCurrent mountpoint or Current workflow will access the workflow repository on the server. The option Relative toCurrent workflow data area, instead, will access the data area of the job copy of the workflow. Please be aware that files written to the data area will be lost if the job is deleted.

Custom/KNIME URL

This option works with a pseudo-file system that allows to access single files via URL. It supports the following URLs:

  • knime://

  • http(s):// if authentication is not needed

  • ssh:// if authentication is not needed

  • ftp:// if authentication is not needed

For this option, you can also set up manually a timeout parameter (in milliseconds) for reading and writing.

The URL syntax should be as follows:

  • scheme:[//authority]path[?query][#fragment]​

  • The URL must be encoded, e.g. spaces and some special characters that are reserved, as ?. To encode the URL you can use any available online URL encoder tool.

Using this option you can read and write single files, but you would not be able to move and copy files or folders. However, listing files in a folder, i.e. browsing, is not supported.

Connected file systems

Connected file systems instead require a connector node to connect to the specific file system. In the connector nodes configuration dialog it is possible to configure the most convenient working directory.

The file system Connector nodes that are available in KNIME Analytics Platform can be divided into two main categories:

Connector nodes that need an Authentication node:

Connector nodes that do not need an Authentication node:

  • Databricks File System Connector node

  • HDFS Connector node

  • HDFS Connector (KNOX) node

  • Create Local Bigdata Environment node

  • SSH Connector node

  • HTTP(S) Connector node

  • FTP Connector node

  • KNIME Server Connector node

File systems with external Authentication

The path syntax varies according to the connected file system, but in most cases it will be UNIX-like. Information on this are indicated in the respective Connector node descriptions.

Typically in the configuration dialog of the Connector node you will be able to:

  • Set up the working directory: In the Settings tab type the path of the working directory or browse through the file system to set up one.

  • Set up the timeouts: In the Advanced tab set up the connection timeout (in seconds) and the Read timeout (in seconds).

03 general connector config

Most connectors will require a network connection to the respective remote service. The connection is then opened when the Connector node executes and closed when the Connector node is reset or the workflow is closed.

It is important to note that the connections are not automatically re-established when loading an already executed workflow. To connect to the remote service you will then need to execute again the Connector node.

03 credentials notavail
Amazon file system

To connect to Amazon S3 file system you will need to use:

  • Amazon Authentication node

  • Amazon S3 Connector node

03 amazon

Amazon S3 file system normalizes paths. Amazon S3 allows paths such as /mybucket/.././file, where ".." and "." must not be removed during path normalization because they are part of the name of the Amazon S3 object. When such a case is present you will need to uncheck Normalize paths option from the Amazon S3 Connector node configuration dialog.

Please be aware that each bucket in Amazon S3 belongs to an AWS region, e.g. eu-west-1. To access the bucket the client needs to be connected to the same region. You can select the region to connect to in the Amazon Authentication node configuration dialog.

03 amazon auth config
Google file systems

We support two file systems related to Google. However, even though they both belong to Google services the corresponding Connector nodes use a different authentication type, and hence Authentication node.

To connect to Google Cloud Storage you will need to use:

  • Google Cloud Storage Connector node

  • Google Authentication (API Key) node

03 google cloud

Also the Google Cloud Storage Connector node, as the Amazon S3 Connector node, normalizes the paths.

To use the Google Authentication (API Key) node​ You will need to create a project at console.developers.google.com.

The specific Google API you want to use has to be enabled under APIs.

After you create your Service Account you will receive a p12 key file to which you will need to point to in the * Google Authentication (API Key) node configuration dialog.

To connect to Google Drive, instead, you will need to use:

  • Google Drive Connector node

  • Google Authentication node

03 google drive

The root folder of the Google Drive file system, contains your Shared drivers, in case any is available, and the folder My Drive. The path of your Shared drivers will then be /shared_driver1/, while the path of your folder My Drive will be /My Drive/.

Microsoft file systems

We support two file systems related to Microsoft.

To connect to SharePoint Online or to Azure Blob Storage you will need to use:

  • SharePoint Online Connector node or Azure Blob Storage Connector node

  • Microsoft Authentication node

The SharePoint Online Connector node connects to a SharePoint Online site​. Here document libraries are represented as top-level folders​.

03 sharepoint

In the node configuration dialog you can choose to connect to the following sites:

  • Root Site: Root site of the your organization​

  • Web URL: https URL of the SharePoint site (same as in the browser)​

  • Group site: Group site of a Office365 user group​

  • Subsite: Connects to subsite or sub-sub-site of the above​

The Azure Blob Storage Connector node connects to an Azure Blob Storage file system.

The path syntax will be UNIX-like, i.e. /mycontainer/myfolder/myfile and relative to the root of the storage. Also Azure Blob Storage Connector node performs paths normalization.

03 azure blob

The Microsoft Authentication​ node offers OAuth authentication for Azure and Office365 clouds.

It supports the following authentication modes:

  • Interactive authentication: Performs an interactive, web browser based login by clicking on Login in the node dialog. In the browser window that pops up, you may be asked to consent to the requested level of access. The login results in a token being stored in a configurable location. The token will be valid for a certain amount of time that is defined by your Azure AD settings.

  • Username/password authentication: Performs a non-interactive login to obtain a fresh token every time the node executes. Since this login is non-interactive and you get a fresh token every time, this mode is well-suited for workflows on KNIME Server. However, it also has some limitations. First, you cannot consent to the requested level of access, hence consent must begiven beforehand, e.g. during a previous interactive login, or by an Azure AD directory admin. Second, accounts that require multi-factor authentication (MFA) will not work.

  • Shared key authentication (Azure Storage only): Specific to Azure Blob Storage. Performs authentication using an Azure storage account and its secret key.

  • Shared access signature (SAS) authentication (Azure Storage only): Specific to Azure Blob Storage. Performs authentication using shared access signature (SAS). For more details on shared access signatures see the Azure storage documentation. ​

03 microsoft auth config

File systems without external Authentication

All the Connector nodes that do not need an external Authentication node will connect upon execution to a specific file system. This allows the downstream nodes to access the files of the remote server or file system.

KNIME Server Connector node

With the KNIME Server Connector node you are able to connect to a KNIME Server instance.

When opening the KNIME Server Connector node configuration dialog you will be able to either type in the URL of the KNIME Server you want to connect to, or to Select…​ it from those available among your mountpoints in KNIME Explorer. You do not need to have the KNIME Server mountpoint set up or already connected to use this node.

03 server connector config

You can authenticate either by typing in your username and password or by using the selected credentials provided by flow variable, if any is available. Pleas be aware that when authenticating typing in your username and password the password will be persistently stored in an encrypted form in the settings of the node and will be then saved with the workflow.

Read and write from or to a connected file system

When you successfully connect to a connected file system you will be able to connect the output port of the Connector node to any node that is developed under the File Handling framework.

To do this you will need to activate the corresponding dynamic port of your node.

You can also enable dynamic ports to connect to a connected file system in Utility nodes.

To add a dynamic port to one of the node where this option is available, right-click the three dots in the left-bottom corner of the node and from the context menu choose Add File System Connection port.

04 add dyn port

Reader nodes

A number of Reader nodes in KNIME Analytics Platform are updated to work within the File Handling framework.

You can use these Reader nodes with both Standard file systems and Connected file systems. Moreover, using the File System Connection port you can easily switch the connection between different connected file systems.

In the following example a CSV Reader node with a File System Connection port is connected to an Azure Blob Storage Connector node and it is able to read a .csv file from an Azure Blob Storage connected file system. Exchange the connection with any of the other Connector nodes, i.e. Google Drive and SharePoint Online, will allow to read a .csv file from the other file systems.

04 multi fs reader example

Transformation tab

When reading a file into KNIME Analytics Platform you can also perform some transformation in order to prepare your data table.

In the updated reader nodes configuration dialog, after selecting the desired file, you can go to the Transformation tab.

04 transformation tab

This tab displays every column as a row in a table that allows modifying the structure of the output table. It supports reordering, filtering and renaming columns. It is also possible to change the type of the columns. Reordering is done via drag and drop. Just drag a column to the position it should have in the output table. Whether and where to add unknown columns during execution is specified via the special row <any unknown new column>. Note that the positions of columns are reset in the dialog if a new file or folder is selected.

If you are reading multiple files from a folder, i.e. with the option Files in folder in the Settings tab, you can also choose to take the resulting columns from Union or Intersection of the files you are reading in.

Writer nodes

Also Writer nodes can be used in KNIME Analytics Platform to work within the File Handling Framework.

You can use these Writer nodes with both Standard file systems and Connected file systems. Moreover, using the File System Connection port you can easily switch the connection between different connected file systems.

An output File System Connection port can be added to Writer nodes and this will allow them to be easily connected to different file systems and will be able to write persistently files to them.

In the following example a CSV Writer node with a File System Connection port is connected to an Azure Blob Storage Connector node and it is able to write a .csv file to an Azure Blob Storage connected file system. A CSV Reader node read a .csv file from a SharePoint Online File System, data is transformed, and the resulting data is written to the Azure Blob Storage file system.

04 multi fs writer example

Path data cell and flow variable

Files and folders can be uniquely identified via their path within a file system. Within KNIME Analytics Platform such a path is represented via a path type. A path type consists of three parts:

  1. Type: Specifies the file system type e.g. local, relative, mountpoint, custome_url or connected.

  2. Specifier: Optional string that contains additional file system specific information such as the location the relative to file system is working with such as workflow or mountpoint.

  3. Path: Specifies the location within the file system with the file system specific notation e.g. C:\file.csv on Windows operating systems or /user/home/file.csv on Linux operating systems.

Path examples are:

  • Local

    • (LOCAL, , C:\Users\username\Desktop)

    • (LOCAL, , \\fileserver\file1.csv)

    • (LOCAL, , /home/user)

  • RELATIVE

    • (RELATIVE, knime.workflow, file1.csv)

    • (RELATIVE, knime.mountpoint, file1.csv)

  • MOUNTPOINT

    • (MOUNTPOINT, MOUNTPOINT_NAME, /path/to/file1.csv)

  • CUSTOM_URL

    • (CUSTOM_URL, , https://server:443/my%20example?query=value#frag)

    • (CUSTOM_URL, , knime://knime.workflow/file%201.csv)

  • CONNECTED

    • (CONNECTED, amazon-s3:eu-west-1, /mybucket/file1.csv)

    • (CONNECTED, microsoft-sharepoint, /myfolder/file1.csv)

    • (CONNECTED, ftp:server:port, /home/user/file1.csv)

    • (CONNECTED, ssh:server:port, /home/user/mybucket/file1.csv)

    • (CONNECTED, http:server:port, /file.asp?key=value)

A path type can be packaged into either a Path Data Cell or a Path Flow Variable. By default the Path Data Cell within a KNIME data table only displays the path part. If you want to display the full path you can change the cell renderer via the context menu of the table header to the Extended path renderer.

Creating path data cells

In order to work with files and folders in KNIME Analytics Platform you can either select them manually via the node configuration dialog or you might want to list the paths of specific files and/or folder. To do this you can use the List Files/Folders node. Simply open the dialog and point it to the folder whose content you want to list. The node provides the following options:

  • Files in folder: Will return a list of all files within the selected folder that match the Filter options.

  • Folders: Will return all folders that have the selected folder as parent. To include all sub folders you have to select the Include subfolders option.

  • Files and folders: Is a combination of the previous two options and will return all files and folders within the selected folder.

05 list files folders config

Creating path flow variables

There are two ways to create a path flow variable. The first way is to export it via the dialog of the node where you specify the path. This could be a CSV Writer node for example where you want to export the path to the written file in order to consume it in a subsequent node. The second way is to convert a path data cell into a flow variable by using one of the available variable nodes such as the Table Row to Variable or Table Row to Variable Loop Start node.

String and path type conversion

Until now not all nodes that work with files have been converted to the new file handling framework and thus do not support the path type. These nodes require either a String or URI data cell or a string flow variable.

From path to string

The Path to String node converts a path data cell to a string data cell. By default the node will create a String representation of the path that can be used in a subsequent node that still requires the old string or URI type, e.g. JSON Reader.

You can download this example workflow from KNIME Hub.
05 path to string wf

If you only want to extract the plain path you can disable the _Create KNIME URL for 'Relative to' and 'Mountpoint' file system option in the Path to String node configuration dialog.

05 path to string config

Similar to the Path to String node, the Path to String (Variable) node converts the selected path flow variables to a string variable.

If you want to use a node that requires a URI cell you can use the String to URI node after the Path to String node.

From string to path

In order to convert a string path to the path type you can use the String to Path node. The node has a dynamic File System port that you need to connect to the corresponding file system if you want to create a path for a connected file system such as Amazon S3.

Similar to the String to Path node the String to Path (Variable) node converts a string flow variable into a path flow variable.

File Folder Utility nodes

With the introduction of the new file handling framework with KNIME Analytics Platform release 4.3, we also updated the functionality of the utility nodes.

You can find the File Folder Utility nodes in the node repository under the category IO > File Folder Utility.

06 file folder utility repo

You can add a dynamic port to connect to a connected file system directly with the File Folder Utility nodes. In this way you can easily work with files and folders in any file system that is available.

In the example below the Transfer Files node is used to connect to two file systems a source file system, Google Drive in this example, and a destination file system, SharePoint Online in this example, to easily transfer files from Google Drive to SharePoint Online.

06 transfer files example

Since some of the nodes names have been changed into nodes with enhanced functionalities, in Table 1 a conversion table is provided.

Table 1. File Folder Utility nodes conversion table
Node Icon Node name Node deprecated name
06 compress files folders

Compress Files/Folder

Zip Files

06 decompress files

Decompress Files

Unzip Files

06 create folder

Create Folder

Create Directory

06 delete files folders

Delete Files/Folders

Delete Files

06 transfer files

Transfer Files

  • Copy/Move Files

  • Download

  • Upload

06 list files folders

List Files/Folders

  • List Remote Files

  • List Files

06 create temp folder

Create Temp Folder

Create Temp Dir