A group of business owners and IoT Experts gather for a round table event with AWS

Back to the list

Company

What AWS Database to use for IoT Data?

Published on 16 Jul, 2024 by Jonathan

Green Custard and AWS recently hosted a round table breakfast meeting for 6 leading climate tech start-ups. The discussion was under Chatham House rules, so we can’t say who said what, but it is worth sharing an interesting discussion we had on IoT and data storage on AWS.

To summarise the back and forth in the room, there was a lot of discussion about whether to use Amazon Timestream, patterns for analytics including the use of Amazon Kinesis, Kafka and Flink, and also comments on the fact that most people store data in Amazon S3, in varying formats.

From the discussion, we can distil two main rules:

S3 is cheap. There is no problem with storing all data from IoT devices in S3. If you are worried about long-term data costs then S3 Lifecycle Policies can be used to manage cost over time. The recommendation from the room was to store data in Apache Iceberg format. Storing data in S3 is also easy to query, using Amazon Athena, and easy to build out Business Intelligence (BI) and analytics using Amazon QuickSight. Having all data in S3 also allows replay or Extract-Transform-Load (ETL) into new databases at any point.
Understand your use cases. AWS has a myriad of database and analytics services, and it is important to understand your use cases to select the right ones. We’ll explore this a little more. Importantly, don’t just use something because you have engineering familiarity with it.

To understand the use cases you need to have a good understanding of your stakeholders/users and the user journeys which are relevant to them. For example, for IoT Connected Products users and use cases could cover:

End users, wanting to see live data or graphs of historic data, responsively via a mobile application
Support engineers, wanting to be able to drill down into specific data depending on the problem they are investigating.
Operators, looking at overall fleet performance, and business metrics.

For our customers with Connected Products, there are often 10 or so stakeholders or user types, so it is important to consider this carefully. There will also be multiple user stories or use cases per user type.

From the use cases, and the data being ingested, you can consider the potential types of operations that will be needed for each use case. You should be able to get an idea of:

For write operations, what is the typical data and at what frequency?
What are the query or read operations, the data they operate over, their frequency, and the responsiveness needed for the user?
Are queries operating against aggregates or raw data, and can aggregates be pre-computed?

You may identify that some use cases share a commonality.

From the write & query patterns, you can select one or more tools from the AWS database tool kit.

If used appropriately and with the proper constraints (e.g. Time To Live - TTL), storage on AWS is generally cheap, so duplication of data in different formats and database types can be very beneficial. Using S3 as a long-term mass storage, then another database for fast access or ‘hotter’ data gives the benefits of both storage types. For example, running a daily ETL process from the raw S3 data into aggregated blocks in Amazon DynamoDB controls the data/read/write costs of DynamoDB while giving fast access to the data. In another example, incoming data can be stored in S3 for long-term storage and but also simultaneously stored in Timestream with a short TTL to give live dashboards of recent data without incurring large storage costs in Timestream.

So the summary is always to store data in S3 as it is cheap, and really understand your use cases and pick the right tools for the job rather than just rely on tools you know about, and finally, multiple data stores for differing purposes are the norm.

Back to the list