[How To] Query datasets across time & versions

Hey everyone,

Here's part two of our How To double header. If you're reading this and you haven't yet checked out the [How To] Load data from a file server to the Data Lake, I highly recommend you check that out first.

Metadata Magic

Right, this is a pretty fun How To to run through. If you've been on calls with me, or attended any of my Inforum or TechEd sessions, then you've invariably heard me say, "data lakes aren't data warehouses." There's a lot to pick apart there and there are some excellent articles online of why and how data lakes and complement your enterprise data strategy.

So, what is a data lake then? It's a temporal data store. The Infor Data Lake uses object storage - a fancy way of saying file storage or file server - to receive and store data exactly as-is, as the application, user, or API intended. CloudSuite applications replicate accumulated changes that take place in the ERP and replicate that as files. Specifically, ndjson files. Though, you're welcome to use alternative formats such as CSV.

When you query the Data Lake, you may very well be retrieving rows of data that are spread across hundreds or thousands of files. How the data is stored doesn't impact the Data Lake's ability to process and retrieve it. But it does make for some interesting file system metadata use cases we'll cover in a future How To.

To get to the point here, the Data Lake captures not just what the current values of a record are within the ERP, but also all the previous versions of what that record looked like. Applications replicate all of their state changes to the Data Lake so that you can start using the Compass tools to scrub across the timeline of a record. This is aided with some metadata wizardry and something we call indicators.

There are three primary indicators today:

Identifier ID: the 'primary key' or 'document id' of a row of data
Variation ID: the 'version' of that particular row
Delete ID: an indicator that signals whether the row reflects a "delete" transaction from the system of record

With these three indicators, it introduces some interesting uses cases on how you can start exploring and analyzing your data over long periods of time.

Sharing is Caring

This time around, I'll make all of the content available up-front to move through this activity. That should make it a bit easier to track down the files you'll need if you'd like to play along at home.

* Editor's Note: Ignore the "syntax error" the forum's JSON parser reports in the file previews below. Hovering over each preview allows you to expand the file into Full Screen (bottom right-hand corner) or Download (upper right-hand corner) the file. I recommend downloading the file rather than copying & pasting things - messy business, sometimes.

Products metadata package

[View:/cfs-file/__key/communityserver-discussions-components-files/96/1145.Products.zip:320:240]

products_1.json

[View:/cfs-file/__key/communityserver-discussions-components-files/96/2402.products_1.json:320:240]

products_2.json

[View:/cfs-file/__key/communityserver-discussions-components-files/96/products_2.json:320:240]

products_3.json

[View:/cfs-file/__key/communityserver-discussions-components-files/96/products_3.json:320:240]

File Ingestion 201

Let's pick up where we left of in our first file server How To and import two new products files. Above, hover over the products_2.json and products_3.json file and download a copy that you can use to upload to your file server.

Before we dive in further, take a deeper look into productid 1 for our Chai product of each file. You should notice that, over time, the data is changing. You'll see that the Unit Price has been updated from $18 to $19.99 before trending downwards and landing at $19.50. Units in Stock also reflects changes over time. And lastly, our variation keeps incrementing by one.

If you haven't previously de-activated your Data Lake Flow, then uploading your files should reflect in the products' files getting to the Data Lake quite quickly. If you haven't seen any movement or changes in OneView, however, it may be worth a quick check to see that your Data Lake Flow is active.

Selecting current, operational data

Once you've uploaded these two additional files, it should become familiar to navigate within ION Desk to the Data Lake menu and drill down to the Compass Query Editor. Let's do a quick filter in our catalog for the Products table and generate a simple SQL SELECT against the Products table. Before running a query, add a filter clause to retrieve only Product ID 1.

[View:/cfs-file/__key/communityserver-discussions-components-files/96/CurrentVariation.sql:320:240]

When we run this query, we retrieve a single row of data for our Chai product ID. This is in spite of the fact that a version of Chai exists in each of the product_*.json files we ingested into the Data Lake.

The key in how the result set is being de-duplicated here is that within the Products metadata definition, we've defined a metadata indicator that uniquely identifies the variation property as the version of the row that was replicated. And you'll see that within each file, the variation number for each version of Chai increments by 1. As you scroll to the right of the result set, you'll find that our de-duplicated results reflect a variation number of 3.

Querying data history

Well, we've seen the Data Lake behave like you'd expect a typical database to. But with some clever SQL hints, we can start exploring data across time and creating unique queries that allow us to search for not just what the data looks like today, but what it looked like over time.

With the use of the following SQL hint, you can instruct the Data Lake query processor to retrieve all variations of a particular row - such as they exist, of course. Within your query, simply insert the following line of SQL hint code:

--*includeAllVariations={table}

Backfill your own table when querying your own content or, if you've been playing along, use the Products table within your hint.

[View:/cfs-file/__key/communityserver-discussions-components-files/96/AllVariations.sql:320:240]

Alternatively, you can automatically generate this query. Simply find or filter for a table in the catalog panel and hover the table to expose the . . . icon. Click and navigate down to the SQL family of content generation and this time, rather than picking the SELECT option, instead choose the SELECT (All Variations) option. Compass will automatically generate your query for you along with the SQL hint compiled for you.

Then, just Run Query (Windows: CTRL + Enter; Apple: ? + return) to retrieve your result set.

Returned back is every instance of Chai.

What's Next?

By this time, hopefully it's clear where the my data warehousing comment repetition comes in Something ordinarily built to spec with the use of a historical data warehouse or data mart is easily accomplished within the Data Lake. The implications of scrubbing over tuple-based timelines, especially for AI/ML use cases, presents some tremendous analytical and pattern value.

This is certainly not all you can do with SQL hints and using the Data Lake's object store metadata to aid in data processing provides some really interesting data exploration and refinement opportunities for your ETL and processing jobs.

Hope you've gotten a kick from this How To!

@mike

Find more posts tagged with

Infor Data Fabric

Comments

There are no comments yet

Copyright © 2025 Infor. All rights reserved.