Difference between revisions of "EclipseSCADA/Plan/Haystack"

Latest revision as of 12:24, 27 August 2014

This is the planning document for Eclipse SCADA Haystack, a big data value archive.

This document will change during the planning and evaluation phase. We are happy to receive your input and thoughts.

Background

The Eclipse SCADA system can store analog values in a value data archive. As a storage is used a plain and simple file system structure named HDS. With this it is possible to store analog values in raw format (including sub-second timestamp resolution) for several years. It provides quick access for standard chart queries and only has a modest storage use (1 item, 1 month, one value every 250ms -> ~120MB). Due to its simple file structure, the HDS archive data can easily be replicated using simple rsync jobs (or other remove file copy tools), to some sort of slave instance which aggregates the data of multiple other value archives. This includes buffering on the local node as long as the connection to the global node is not available.

However the system will not scale beyond one node for an aggregated archive.

Purpose

The purpose of Eclipse SCADA Haystack is to provide a value data archive for analog values which can scale with the numbers of nodes added. The archive should be an aggregated archive, which gathers data from a distributed set of nodes and stores it persistently in a cluster.

Requirements

Technical

Scale with the size of the cluster
Provide data redundancy inside the cluster

Figures

The archive should be able to handle 10.000.000 items with an insertion rate of 1.000.000 insertions per second.

Configuration

The collector nodes and items should be configured on the server, not on the collector node. The collector node only contacts the cluster and fetches the newest configuration, providing the required values.

Data

Store analog floating point values (Java "double")
The naming of the data stores (items) should not be restricted
Store values with sub-second resolution
Provide a way to detect stale value sources (timeout)
Insertion operations should be idempotent

Interface

Provide a simple HTTP/HTTPS based interface for provided data
Check out other systems and provide adapters for importing their data
- collectd?
- others?
Provide an OpenTSDB collector interface

Limitations

Append only

The archive is designed to receive a stream of values in chronological order. This means that for one value source all items must be provided by the collector node with ascending timestamps. Unless state otherwise, data cannot be inserted at a random point in time.

Architecture & Concepts

The idea at the moment is to use Hadoop and HBase for creating the storage cluster.

The collector nodes push the data to import nodes using HTTP. Collector nodes can be located anywhere. Import nodes must be accessible for the collector nodes using HTTP, but can be behind some load balancer. Each import node should be able to import data for any collector node.

Once the import node has completed importing the data, it returns the result with the HTTP reply.

Query functionality will process query requests aside from the import nodes by directly accessing the HBase cluster. At the moment there is no plan for specific server applications beside the HTTP daemon on the import node. This reduces the need for a specific query node <-> client protocol. Eclipse SCADA has already such a protocol for its use case. However this is not a limitation, a specific query server with remoting can be added when needed.

Row key schema

Internal Storage

The data of all sources is stored in the same table
Each value is inserted independently at first
Then a second table will store compacted data frames
Compaction is triggered periodically

Compaction

Data is compacted by a time range of 2^16 ms -> about 65seconds
The compaction has a "high watermark" timestamp which indicates up to with time frame the data is compacted
Queries need to load from the compacted table first, when the "high watermark" is reached the remaining data must be read from the raw table
Compaction writes the data first and then the "high watermark", including the latest value
Queries read the "hight watermark" first, then read the data
Uncompacted data needs to the kept for a specific amount of time, for queries that still need to read raw, uncompacted data.

Glossary

node: A single computer, physical or virtual
collector, collector node: A node or service collecting data
item, source: A source that provides data exactly for one measure point

@@ Line 2: / Line 2: @@
 This document will change during the planning and evaluation phase. We are happy to receive your input and thoughts.
+[https://docs.google.com/document/d/1ychvdPHBxfd_9_nfxB6a1LYhIhc1kmb8GcQnLIO48so/edit#heading=h.1zq7g2v496tc Draf proposal on Google Drive]
 == Background ==

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "EclipseSCADA/Plan/Haystack"

Latest revision as of 12:24, 27 August 2014

Contents

Background

Purpose

Requirements

Technical

Figures

Configuration

Data

Interface

Limitations

Append only

Architecture & Concepts

Row key schema

Internal Storage

Compaction

See also

OpenTSDB

Kairosdb

Glossary

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "EclipseSCADA/Plan/Haystack"

Latest revision as of 12:24, 27 August 2014

Contents

Background

Purpose

Requirements

Technical

Figures

Configuration

Data

Interface

Limitations

Append only

Architecture & Concepts

Row key schema

Internal Storage

Compaction

See also

OpenTSDB

Kairosdb

Glossary