Tips for Implementing Talend Big Data Governance

Tips for Implementing Talend Big Data Governance

In this series we will cover five tips to transforming your Big Data activity into feasible accomplishment with Data Governance and Metadata Management. This series explains how each of the main components of the Talend Data Fabric can address the problems of transforming the data with Talend. We call it the five columns for overseeing metadata with Talend.

MetadataManagement with Talend

Without metadata, there is no way to create a holistic and actionable view of the information supply chain. And this view is a prerequisite not only to manage change, provide auditability and traceability on data flows, but also to increase data accessibility through easy to use access mechanisms such as search or visual maps. Although metadata can be retro-engineered in some cases, it is much easier to collect, process, maintain and track metadata at its source as soon as it is created.

Streamline Your Metadata over your Data Platforms with Talend Metadata Bridge

Talend Metadata bridge allows developers to import and export data from Talend Studio (and similarly from Talend Metadata Manager), as well as access metadata from virtually any data platform. With more than 100 connectors provided, Talend Metadata Bridge helps harvest metadata from modelling tools like Erwin or Embarcadero; ETL tools like Informatica or IBM DataStage; SQL and NoSQL databases; Hadoop; popular BI and Data Discovery tools like Tableau, Qlik or BusinessObjects; as well as XML or Cobol structures, etc.

The bridges allow developers to design data structures once and propagate them across various tools and platforms repeatedly. Then you can easily enforce standards, propagate changes, and facilitate migrations, since data formats can be translated from virtually any third party tool or platform to Talend. For example, you can take an Oracle table and import it into Talend, and then propagate it to another third party platform such as Redshift. Talend Big Data can also easily offload a traditional ETL job into a native Hadoop process.

Handle Hadoop administration Challenges with Talend Big Data

By design, Hadoop accelerates data proliferation. Also, unlike traditional databases that provide a single point of reference for data, data manipulations, and their related metadata, Hadoop combines multiple storage and data processing options. Additionally, as part of its high availability strategy, Hadoop tends to replicate data across many nodes and to create intermediary copies of raw data between processing steps. Data Lineage therefore becomes critical to provide traceability and auditability of data flows inside Hadoop. All of these factors pose significant threats to data governance.

But the beauty of Hadoop is that it’s an open and extensible community based framework. Its weaknesses trigger innovation projects to address the issues and turn them into strengths. Apache Atlas and Cloudera Navigator are the most common Hadoop extensions to address the specific challenges of data governance within Hadoop.  

Talend Big Data seamlessly integrates with Cloudera Navigator or Apache Atlas (for Hortonworks) and exposes the detailed metadata for its data flows to each of these third-party data governance environments. Through this capability, Talend enriches those environments with data lineage capabilities that go into much greater depth vs. if the data flows were directly hand-coded in Hadoop or Spark. Thanks to Cloudera Navigator and Apache Atlas, the metadata generated by Talend can be connected to other data points, searched, visualized as maps for data lineage, and shared with potentially any authorized users in the Hadoop environment, beyond Talend developers and administrators. They also make Metadata more actionable by triggering actions (such as auto classification of metadata, definition of retention policies…) for specific datasets based on arrival or scheduled intervals.

As an example, Talend was the first vendor to deliver field level data lineage for Spark in Cloudera Navigator, a critical capability for big data use cases in heavily regulated environments such as financial services or life sciences.

Build Data Lake with Enhanced Data Accessibility

Until now, data governance may have been perceived as an administrative constraint rather than a value-add by business users, but in truth there are many benefits it can bring. For example, would you consume food from a retail store without first reading the label and ensuring it was properly packaged? Knowing the name, the origin, the ingredients, the weight and quantity, nutrition facts, etc., are crucial to understand before consuming any food item. The same principles should apply to data.

Talend provides a Business Glossary in Talend Metadata Manager to allow data stewards to maintain the business definitions for all data, link it to the tools and environments where it can be accessed (such as Hive tables in Hadoop or Tableau dashboards), and finally expose it to business users. Similarly, Talend Data Preparation provides its own dataset inventory to allow anyone to access, cleanse and shape data as a self-service. Because self-service is a key part of Talend’s market vision, stay tuned for more innovations in this area.

Oversee and Monitor Data Flows past Hadoop with Talend Metadata Manager

Gone are the days of thinking it was feasible to manage every data source in one place. Legacy systems are here to stay; enterprise apps such has Microsoft, SAP and Oracle will continue to operate core business processes; cloud applications will continue to proliferate; and traditional data warehouse and departmental BI will coexist with more modern data platforms like Hadoop for some time.

Not only does this increase the need for environments such as Talend Data Fabric to manage the data flows across those environments, but it drives the need for a platform that provides a holistic view of the information chain, wherever data resides. Organizations that are operating in heavily regulated environments go so far as to mandate these capabilities for their audit trails.