[{"data":1,"prerenderedAt":100},["ShallowReactive",2],{"\u002Fen\u002Fprojects\u002F2019\u002Freclameaqui-data-lake":3},{"id":4,"title":5,"body":6,"createdAt":77,"description":78,"extension":79,"meta":80,"navigation":90,"path":91,"seo":92,"slug":93,"stem":94,"tags":95,"website":98,"__hash__":99},"projects\u002Fprojects\u002F2019\u002Freclameaqui-data-lake.md","ReclameAQUI Data Lake",{"type":7,"value":8,"toc":69},"minimark",[9,14,19,23,26,30,33,36,40,44,47,50,53,56,59,63,66],[10,11,13],"h1",{"id":12},"introduction","Introduction",[15,16,18],"h2",{"id":17},"summary","Summary",[20,21,22],"p",{},"ReclameAQUI (Portuguese for \"complain here\") is an interesting and unique\nbusiness. They're a content aggregator for customers' experience sharing\n(especially bad experiences) about shopping (online and offline). However, it\ngoes further than a mere \"complaints website\" offering an interface for\ncompanies to answers complaints, helping customers with their issues.",[20,24,25],{},"The service is simply the biggest in this regard (worldwide) receiving 600K\nunique visitors each day, searching for a company's reputation before closing a\ndeal\u002Fpurchase.",[15,27,29],{"id":28},"problem","Problem",[20,31,32],{},"Even though they are already advanced in the digital approach to business,\nhaving most services hosted on Cloud computing and analytical culture, their\ndata lake needed some upgrades. The most relevant motivator of this project was\nthe sky-high bills from GCP especially related to BigQuery data consumption.",[20,34,35],{},"Apart from the cost-reduction tasks and data ingestion process optimization, we\ntook the opportunity to implement data cryptograph at-rest, governance, and\nobfuscation during query executions against the data lake. Making data\naccessible by everyone in the company, controlling identity access and\nmanagement through LDAP (auditing each access, to be fully compliant with\nGDPR), we could offer a self-service data lake so different business actors\ncould satisfy their needs \"drinking\" from the lake.",[10,37,39],{"id":38},"solution","Solution",[15,41,43],{"id":42},"tech-implementation","Tech implementation",[20,45,46],{},"Key objectives were cost-optimization of the existing Data Lake, improvement\n(and extension) of existing data ingestion pipelines, and security enhancements.",[20,48,49],{},"Starting from Data Lake's cost optimization, we redesigned the data ingestion,\nusing a \"landing\" area for raw data, making data transformations later to suit\nthe desired data models. Saving the results in other Data Lake layers to achieve\ngreater performance in queries.",[20,51,52],{},"We shifted away from the Streaming inserts in BigQuery by adding a step to load\ndata at the end of the ingestion pipeline. Apache NiFi was the main software\nresponsible for orchestrating and executing the pipeline, covering also the\nimprovements in data ingestion through processes re-engineering.",[20,54,55],{},"Auditing in the Data Lake was managed through Apache Ranger. In order to have\nit fully supported we implemented a JDBC driver using a component from Apache\nCalcite called Avatica. Authentication for Apache Ranger went through a custom\nplugin (also developed during the project) for LDAP consuming user info from\nGoogle Cloud Identity, reflecting the existing organization's users and groups\nfrom Google Suite.",[20,57,58],{},"To make the game more interesting, we containerized the workflow and heavily\nused Kubernetes (GKE) to manage these components. Most of the Apache projects\ndidn't have Helm Charts at the time and we developed and made some\nof them open-source.",[15,60,62],{"id":61},"impact-and-results","Impact and results",[20,64,65],{},"During project time we could measure an estimative of roughly 56% in Data Lake\ncost-optimization through reengineering of processes and resources, especially\nthe removal of streaming inserts to BigQuery.",[20,67,68],{},"We made relevant progress in security and governance during the project with the\nintroduction of Apache Ranger and Data Lake auditing for access and usage,\nproviding advanced security capabilities to ReclameAQUI, which anticipated itself\ntowards GDPR and data privacy concerns.",{"title":70,"searchDepth":71,"depth":71,"links":72},"",2,[73,74,75,76],{"id":17,"depth":71,"text":18},{"id":28,"depth":71,"text":29},{"id":42,"depth":71,"text":43},{"id":61,"depth":71,"text":62},"2019-10-02T00:00:00","Containerized Data Lake running on GCP, using Kubernetes (GKE) to orchestrate Apache ecosystem components, with GCS for data storage and BigQuery as the analytical interface.\nGovernance and security fully implemented using existing Google Suite groups and users through LDAP, giving stakeholders full autonomy to consume data from the Lake (with auditing).","md",{"duration":81,"tools":84},{"from":82,"to":83},"2019-05-01T00:00:00","2019-09-30T00:00:00",[85,86,87,88,89],"apache spark","kubernetes","python","google bigquery","apache nifi",true,"\u002Fprojects\u002F2019\u002Freclameaqui-data-lake",{"title":5,"description":78},"reclameaqui-data-lake","projects\u002F2019\u002Freclameaqui-data-lake",[96,97],"cloud-native","data lake",null,"QKyuci8jk1a_mXWiDZPW8IrSycgLC-_Ho1g0ydP1aL8",1778441743932]