Challenges with data lakes

Nitish Kumar
1 min readJan 2, 2022
Data Lake

There are some challenges with a data lake. Some are:-

  1. Hard to append data

Adding newly arrived data leads to incorrect reads

2. Modification of exiting data is difficult

GDPR/CCPA requires making fine-grained changes to exiting the data lake.

3. Jobs failing mid way

Half of the data appears in the data lake, the rest is missing.

4. Real-time operations

Mixing streaming and batch leads to inconsistency.

5. Costly to keep historical versions of the data.

Regulated environments require reproductivity, auditing, governance.

6. Difficult to handle large metadata

For large data lakes, the metadata itself becomes difficult to manage.

7. Too many files problems

Data lakes are not great at handling millions of small files

8. Hard to get great performance

Partitioning the data for performance is error-prone and difficult to change.

9. Data quality issue

It's a constant headache to ensure that all the data is correct and of high quality.

Note: In case you notice any errata in my understanding, feel free to reach out and let me know of the same and I will update the blog post accordingly.

--

--