[{"data":1,"prerenderedAt":108},["ShallowReactive",2],{"\u002Fen\u002Fprojects\u002F2018\u002Fclipping-service-news-ocr":3},{"id":4,"title":5,"body":6,"createdAt":83,"description":84,"extension":85,"meta":86,"navigation":97,"path":98,"seo":99,"slug":100,"stem":101,"tags":102,"website":106,"__hash__":107},"projects\u002Fprojects\u002F2018\u002Fclipping-service-news-ocr.md","Clipping Service News OCR",{"type":7,"value":8,"toc":75},"minimark",[9,14,19,23,26,29,33,36,39,43,47,50,53,56,59,62,65,69,72],[10,11,13],"h1",{"id":12},"introduction","Introduction",[15,16,18],"h2",{"id":17},"summary","Summary",[20,21,22],"p",{},"As the owner of the biggest Brazilian media data set, the media monitoring\nleader Clipping Service was having issues with scalability, getting close to\ntheir data center maximum capacity.",[20,24,25],{},"Clipping Service operates on a huge scale, receiving around ~4.5K media press\npages per day from roughly 300 newspapers in both digital and printed versions.\nPreviously employees called \"readers\" were responsible for reading and clipping\n(adding highlight into the targeted content) to be later passed onto the\n\"reviewers\" team.",[20,27,28],{},"As if the burden of reading countless pages a day were not enough, the readers'\noperation begins around 4:30 a.m. when the \"first reading\" begins (i.e., the\ndelivery of the morning papers).",[15,30,32],{"id":31},"problem","Problem",[20,34,35],{},"For over 20 years this content has been ingested by so-called \"readers\". But due\nto the advent of the internet and the digital press boom at the end of the 90's,\nand nowadays of social media, companies are transferring their clipping\ninvestments to monitoring other areas. Therefore requiring a Clipping Service\naction to remain competitive in the market.",[20,37,38],{},"Through news reading automation using OCR, NLP, and artificial intelligence to\ncategorize media, the plan was to achieve a higher throughput during ingestion,\ngiving readers more free time to review the content. Consequently achieving a\nhigher quality in the content, since we as humans aren't good at doing\nrepetitive tasks, especially when it comes to reading endless pages searching\nfor names and words.",[10,40,42],{"id":41},"solution","Solution",[15,44,46],{"id":45},"tech-implementation","Tech implementation",[20,48,49],{},"After spending some time researching and benchmarking the alternatives at hand\nwe decided to use Python as the implementation language for handling texts,\nOCR, and NLP (using NLTK). Given its extended API and libraries for NLP and\nimage processing.",[20,51,52],{},"As the cloud provider we choose AWS due to it's stability and consistency over\nother vendors, the conclusion at the time was: AWS price estimative 14.67%\ngreater than GCP. However, AWS's popularity is greater than GCP and proven in\nterms of stability, support, and integrity. Making a safer choice for a\nslightly higher price.",[20,54,55],{},"The tech stack was: Python 3 using Dramatiq as the task processing library,\nrunning Tesseract OCR jobs, processing text with NLTK and images with Pillow\n(ImageMagick wrapper). Redis was the message broker for Dramatiq, a simple\nPostgres database stored metrics regarding the execution and we had an\nElasticsearch storing the processed content.",[20,57,58],{},"Requests coming from the data center reached an API Gateway, responsible for\nexecuting a Lambda function, and delivering the content result.",[20,60,61],{},"The best part of the design? We stored and served the content via AWS S3. Each\npart was designed with fault tolerance, and we simply turned off the entire\ncloud infrastructure after the operation, to turn on only the next day.",[20,63,64],{},"Operating only from 4am to 2pm, a \"serverless\" and ephemeral project benefiting\nfrom an aggressive cloud cost reduction.",[15,66,68],{"id":67},"impact-and-results","Impact and results",[20,70,71],{},"Clipping Service reduced its reading team workforce by ~78%, offering internal\nhiring for other areas of the company and a voluntary dismissal plan with\nbenefits, making the process as human as possible for the former employees.",[20,73,74],{},"Using automation for reading tasks, Clipping Service could reach considerable\nimprovements in the media press ingestion throughput (around 20 times faster),\noffering higher quality in press clipping service for its customers and saw the\nopportunity in creating a self-service press clipping service later, since the\noperational cost decreased significantly.",{"title":76,"searchDepth":77,"depth":77,"links":78},"",2,[79,80,81,82],{"id":17,"depth":77,"text":18},{"id":31,"depth":77,"text":32},{"id":45,"depth":77,"text":46},{"id":67,"depth":77,"text":68},"2018-09-11T00:00:00","Media monitoring and news clipping service automation through artificial intelligence, OCR and NLP. Delivering a higher throughput to the operation and creating a true serverless infrastructure to extend its Data Center capabilities.","md",{"duration":87,"tools":90},{"from":88,"to":89},"2018-03-31T00:00:00","2018-09-12T00:00:00",[91,92,93,94,95,96],"tesseract ocr","python","dramatiq","aws","aws ecs","noops",true,"\u002Fprojects\u002F2018\u002Fclipping-service-news-ocr",{"title":5,"description":84},"clipping-service-news-ocr","projects\u002F2018\u002Fclipping-service-news-ocr",[103,104,105],"cloud-native","serverless","nlp",null,"0RGWaWssukzcQPvJ5JqnWqzV2x7lSEMHVrMQ83xlGbE",1778441743997]