ReclameAQUI Data Lake
ReclameAQUI (Portuguese for “complain here”) is an interesting and unique business. They’re a content aggregator for customers' experience sharing (especially bad experiences) about shopping (online and offline). However, it goes further than a mere “complaints website” offering an interface for companies to answers complaints, helping customers with their issues.
The service is simply the biggest in this regard (worldwide) receiving 600K unique visitors each day, searching for a company’s reputation before closing a deal/purchase.
Even though they are already advanced in the digital approach to business, having most services hosted on Cloud computing and analytical culture, their data lake needed some upgrades. The most relevant motivator of this project was the sky-high bills from GCP especially related to BigQuery data consumption.
Apart from the cost-reduction tasks and data ingestion process optimization, we took the opportunity to implement data cryptograph at-rest, governance, and obfuscation during query executions against the data lake. Making data accessible by everyone in the company, controlling identity access and management through LDAP (auditing each access, to be fully compliant with GDPR), we could offer a self-service data lake so different business actors could satisfy their needs “drinking” from the lake.
Key objectives were cost-optimization of the existing Data Lake, improvement (and extension) of existing data ingestion pipelines, and security enhancements.
Starting from Data Lake’s cost optimization, we redesigned the data ingestion, using a “landing” area for raw data, making data transformations later to suit the desired data models. Saving the results in other Data Lake layers to achieve greater performance in queries.
We shifted away from the Streaming inserts in BigQuery by adding a step to load data at the end of the ingestion pipeline. Apache NiFi was the main software responsible for orchestrating and executing the pipeline, covering also the improvements in data ingestion through processes re-engineering.
Auditing in the Data Lake was managed through Apache Ranger. In order to have it fully supported we implemented a JDBC driver using a component from Apache Calcite called Avatica. Authentication for Apache Ranger went through a custom plugin (also developed during the project) for LDAP consuming user info from Google Cloud Identity, reflecting the existing organization’s users and groups from Google Suite.
To make the game more interesting, we containerized the workflow and heavily used Kubernetes (GKE) to manage these components. Most of the Apache projects didn’t have Helm Charts at the time and we developed and made some of them open-source.
Impact and results
During project time we could measure an estimative of roughly 56% in Data Lake cost-optimization through reengineering of processes and resources, especially the removal of streaming inserts to BigQuery.
We made relevant progress in security and governance during the project with the introduction of Apache Ranger and Data Lake auditing for access and usage, providing advanced security capabilities to ReclameAQUI, which anticipated itself towards GDPR and data privacy concerns.