Hi,
I have object in datalake with more than 10K records. I am trying to use splitquery Data Fabric APi but without results. I would appreciate for hint how to extract 10K results using Data Fabric API.
Best,
Thomas
Here is an example I just tested that returned 4 queries.
/DATAFABRIC/datalake/v2/dataobjects/splitquery?filter=dl_document_name%20eq%20'LN_tcmcs098'&records=200
There are some considerations to the ratio of objects in the filter. How many Files for the object you are requesting in the data lake? And how are you filtering, and how many are you requesting in each filter.What response are you getting when you state "without results"?Does the filter work in the GET /dlobjects request? Result
[ { "queryFilter": "(document_name eq 'LN_tcmcs098' AND dl_id range {null,"1-52c03021-952a-3b27-837b-c65bccaf99b4"})", "sortFields": [ "dl_id:asc" ] }, { "queryFilter": "(document_name eq 'LN_tcmcs098' AND dl_id range ["1-52c03021-952a-3b27-837b-c65bccaf99b4","1-a3626013-0b52-3b4a-abcb-876964a34fd9"})", "sortFields": [ "dl_id:asc" ] }, { "queryFilter": "(document_name eq 'LN_tcmcs098' AND dl_id range ["1-a3626013-0b52-3b4a-abcb-876964a34fd9","1-f2356bac-2155-39e6-b42c-763806df0b0c"})", "sortFields": [ "dl_id:asc" ] }, { "queryFilter": "(document_name eq 'LN_tcmcs098' AND dl_id range ["1-f2356bac-2155-39e6-b42c-763806df0b0c",null})", "sortFields": [ "dl_id:asc" ] } ]
Thank you!
I have approximately 11K files of that object in datalake (expect significantly more in the future). And exactly I have problem with setting number of records for the filter. When I've gone below 2000 I have received "Internal Server Error". It is not clear for me how to calculate safe "records" number so it is not too big but be on the safe side and not receive the error.
Ok so the interface is working, just has issues with using the right calculation against the data. Not that on the Filter you can further limit the data , commonly using a day range so that you have less in scope.
We would expect that you are only pulling the data once and not pulling all the data every day. Inital loading/ resets would be the exception to needing a larger quantity.
Consider using a time increment then setting a starting value, pull for next increment, and keep pulling until you are to current then pull every increment forward.
You could also consider using the 'extract' from data lake in a flow to send you the files as they are available on a time interval. This somewhat does all of this for you, just as individual files.
Hi Kevin,
Thank you for very valuable answer. Really apreciate it.
My understanding is that datalake api returns the most actual variation of object (?) - this is not clearly described in the documentation but at least based on first test I assume this is the case. If so adaption of incremental approach would mean that I have to find and update the record on the other side. Am I correct?
Could please advise also why I got "Internal Server Error" when set records number below 2000? Is there any predictable estimation I can make or this is matter of internal timeout? I believie that calculating chunks (filters) on huge dataset could be time/resource consuming.