Unlocking The Power Of Unconnected Lookup In Talend
Challenge
Sometimes the lookups are so big that they don’t fit in memory and there is no better way to unload the data for heavy lookups and use it multiple times in the same and child jobs.
Solution
To overcome this issue BigData Dimension team has created components to support Unconnected Lookup functionality in Talend to be able to do lookups without using tMap or the tJoin components and get the flexibility of loading the data once and use it in multiple jobs. If the dataset doesn’t fit in memory we can select a disk based storage to store the data and also use that stored data in the child jobs.
Output
This allows us to create and populate HashMaps. The append existing option is shared between the Basic and Advanced settings. When creating a new hash we have the option to select whether we want to store the data in-memory or on disk.
Basic settings
By default we use the keys defined in the schema. This can be overridden, by checking the „Manually Provide Key Column name” checkbox. When checked we can provide one ore more key columns. If the key is complex the columns can be listed using a comma separated list.
The values work the same way, we can define one ore more columns separated by comma.
!!! Keep in mind that you can’t use context or other dynamic variables here because these informations are required when the Java code is generated. !!!
Advanced settings
These settings are used when the input schema type is Dynamic. The key and value fields work similarly to the ones in the Basic settings, but their value are evaluated during runtime, hence globalMap, context variables can be used here.
Map
Basic settings
To be used with regular schemas. Basic mapping contains the 1:1 column mapping. Lookup mapping contains the lookups for the new columns.
The table column names are self explaining. Input Key can contain more columns separated by a comma. Value column needs to have a valid column name defined in the lookup. Join Mode: in case of INNER join, if the lookup fails the row is skipped. (Thrown out)
Advanced settings
To be used with dynamic Schemas. Columns to be removed is a comma separated list of the columns that we’d like to remove from the dynamic line. Columns to lookup work similarly to the one in the Basic settings, but here we can’t define the new types in the schema, that needs to be defined here. Talend Type and Db Type is used for that.
Column calculations work the same way: we add a new column then populate its value. They input column can be added as: input_row.column_name which in runtime gets replaced with the actual value.
How to Use Examples
Basic Map usage
Inputs
This section shows how can we use the map with predefined schemas.
We’re going to have 2 lookups and will do 1 + 2 + 1 lookup. The 2 lookups will look like this:
1.) EMP: Lookup 1
Lookup 1 Basic settings
2.) LOC: Lookup 2
Lookup 2 Basic Settings
3.) Job Overview
4.) Input: Main input
What we can do with this input is a lookup based onthe employee. We can look up the emp_name and emp_loc to this.
Execution
5.) Basic Settings
Upon running this yields the following:
We can do another lookup based on the employee_site column.
6.) Basic Settings
Which results in:
Multiple Lookups
We can use multiple lookups and multiple value from a lookup. A lookup can be used as many times as required.
We can use different keys for the same lookup:
8.) Basic Settings – Lookup Mapping
Please note that we get the Object map from the lookup once then we can look it up as many times as required.
Second map:
here we use the same map, but with different keys. We want to know the location name for both the sales location and employee base location. This requires 2 lookups.
Advanced Usage
Using Dynamic Schema
The main challenge is the modifying of the dynamic schema, talend is very sensitive to this, so it needs to be executed carefully else we’ll get different results than expected. Most of the functions around dynamic schema will soak the error silently, and has no feedback if it was executed successfully.
Using a lookup populated in a child job
In this case the parent jobs will look like:
11. Overall look
12. tRunJob setting
Use CTRL + SPACE to bring up the _DB aftervariable. lookup type is Object!
Then we place a single outputcomponent in all the child jobs where we want to use the parent:
13. Advanced Setting
No other parameters have to be added. Then use it like it was populated in that job.
ABOUT BIG DATA DIMENSION
BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.
TALEND CUSTOM SOLUTION MADE FOR YOU
You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.
At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far as Talend goes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.
Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.
We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.
GETTING STARTED
-
Full requirements analysis of your infrastructure
-
Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise
MEETING YOUR VARIOUS NEEDS
-
BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
-
Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
-
ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
-
Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
-
Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
-
Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
-
Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy
WHAT WE DO
As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.
We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.
OUR CLOUD DATA LAKE SOLUTION
CloudCDC is equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.
FEATURES
• Build Data Lake on AWS, Azure and Hadoop
• Continuous Real Time Data Sync.
• Click-to-replicate user interface.
• Automated Integration & Data Type Mapping.
• Automated Schema Build.
• Codeless Development Environment.
OUR SOLUTION ENHANCES DATA MANAGEMENT ACROSS INDUSTRIES
CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC, TALEND, DATA ANALYTICS, AND BIG DATA NEEDS.