Talend tUnconnected Lookup Properties & Usage Scenarios

tUnconnectedHashOutput
This component pushes data to cache that can be either in-memory or on the file system. Later this can be used with the tUnconnectedHashMap component to do lookups / joins. Hashes can be shared between jobs. Key(s) can be built from one or multiple columns.
This component relies on MapDB for storing data: http://www.mapdb.org/ An older version of mapDB is used for backward compatibility.
tUnconnectedHashOutput Standard properties
These properties are used to configuretUnconnectedHashOutputrunning in theStandardJob framework.
Basic settings
Schema and Edit schema |
A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either built-in or remotely stored in the Repository. ClickEdit schemato make changes to the schema. If the current schema is of theRepositorytype, three options are available:
This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, seeTalend StudioUser Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables. |
|
Built-in: The schema is created and stored locally for this component only. Related topic: see the Talend Studio User Guide. |
|
Repository: The schema already exists and is stored in the Repository, hence can be reused. Related topic: see the Talend Studio User Guide. |
Append existing |
Select this check box to connect to atUnconnectedHashOutputcomponent. |
Component list |
Drop-down list of availabletUnconnectedHashOutputcomponents. |
Storage Method |
Changes how the cache is stored.
|
Manually Provide Key Column name |
List the column names that will be used to create the key for the cache. This is used during code-generation, hence globalMap references and or context variables shall not be used, because those values are unknown during code generation. |
Value column name |
Provide the column(s) that should be stored in the cache. Later only these columns could be retrieved. One ore many columns can be stored. |
Advanced settings
Use existing |
This is used if an existing tUnconnectedHashOutput should be used. It is similar to Append existing under the Basic Settings. This should be used ofthe cache was created outside of the job. e.g at a parent job. unknown during code generation. Please note that we need the Global Variable that ends with _DB |
The rest of the settings require Subscription (Enterprise version)
Column name that holds the key |
List the column names that will be used to create the key for the cache. Oppose to the Basic settings this is NOT used during code-generation, hence globalMap references and/or context variables be be used. |
Value column name |
Provide the column(s) that should be stored in the cache. Later only these columns could be retrieved. One ore many columns can be stored. Can be a globalMap / context variable. |
tStatCatcher Statistics |
Select this check box to collect log data at the component level. |
Global Variables
Global Variables |
NB_KEYS: the number of unique keys in the map. This is an After variable and it returns an integer. NB_LINE: the number of rows processed. This is an After variable and it returns an integer. DB: The DB object that can be used later to connect to the cache and do operations. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, pressCtrl + Spaceto access the variable list and choose the variable to use from it. For further information about variables, seeTalend StudioUser Guide. |
Usage
Usage rule |
This component populates data fortUnconnectedHashMap component. Together, these twin components offer high-speed data access to facilitate lookups/joins involving a massive amount of data. |
tUnconnectedHashMap
This component allows us to do lookups against cache(s) populated by tUnconnectedHashOutput component. Additionally custom java code can be ingested in a similar manner how tJavaRow works. Hashes can be shared between jobs. Key(s) can be built from one or multiple columns.Join can be inner or left.
This component relies on MapDB for storing data: http://www.mapdb.org/ An older version of mapDB is used for backward compatibility.
tUnconnectedHashMap Standard properties
These properties are used to configuretUnconnectedHashMaprunning in theStandardJob framework.
Basic settings
Schema and Edit schema |
A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either built-in or remotely stored in the Repository. ClickEdit schemato make changes to the schema. If the current schema is of theRepositorytype, three options are available:
This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, seeTalend StudioUser Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables. |
|
Built-in: The schema is created and stored locally for this component only. Related topic: see the Talend Studio User Guide. |
|
Repository: The schema already exists and is stored in the Repository, hence can be reused. Related topic: see the Talend Studio User Guide. |
Basic Mapping |
This table allows one to define the 1:1 mapping between input and output schemas. The generated Java code will be: Output column = Input column |
Lookup mapping |
This table has the following columns:
|
Column calculations |
This table has the following columns:
|
Advanced settings
These settings require Subscription (Enterprise version). Please be careful with these features because this partially goes against the principle of using dynamic schemas. Use it wisely.
Columns to be removed |
List the column names that shall be removed from the dynamic schema. |
Columns to Lookup |
This table has the following columns:
|
Column calculations |
This table has the following columns:
|
tStatCatcher Statistics |
Select this check box to collect log data at the component level. |
Global Variables
Global Variables |
NB_LINES_OUT: the number of rows processed. This is an After variable and it returns an integer. NB_LINES_IN: the number of input rows. This may be higher than NB_LINES_OUT if Inner join is used. This is an After variable and it returns an integer. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, pressCtrl + Space to access the variable list and choose the variable to use from it. For further information about variables, seeTalend StudioUser Guide. |
Usage
Usage rule |
This component use the data populatd by tUnconnectedHashOutput component. Together, these twin components offer high-speed data access to facilitate lookups / joins involving a massive amount of data. |
Scenario
In this example we’ll create a lookup based on 2 keys, and we’ll retrieve 2 columns from that lookup. This means our lookup schema consist of 4 columns total. These 4 columns will be: (Its not in 3NF format but for demo purposes its good.)
First Name, Last Name, Join Date, Expected Salary
Our main flow will contain the following columns:
First Name, Last Name, Transaction Date, Transaction Amount
We’ll try to find all the customers that joined more than 3 months ago, and their Incomes are less than their expected salary.
If we don’t add the customers who joined in the last 3 months to the lookup, and use inner join we could solve the first problem: customers that joined more than 3 months ago. For the second half of the problem we’d have to group by month to get the monthly income. This in practice looks like this:
We do an inner join against tUnconnectedHashOutput, and a calculations to get only the month back.
Once this is done we can Aggregate it by that calculated column.
Then filter down:
The overall workflow:
And the output:
ABOUT BIG DATA DIMENSION
BigData Dimension is a leading provider of cloud and on-premise solutions for BigData Lake Analytics, Cloud Data Lake Analytics, Talend Custom Solution, Data Replication, Data Quality, Master Data Management (MDM), Business Analytics, and custom mobile, application, and web solutions. BigData Dimension equips organizations with cutting edge technology and analytics capabilities, all integrated by our market- leading professionals. Through our Data Analytics expertise, we enable our customers to see the right information to make the decisions they need to make on a daily basis. We excel in out-of-the-box thinking to answer your toughest business challenges.
TALEND CUSTOM SOLUTION MADE FOR YOU
You’ve already invested in Talend project or maybe you already have a Talend Solution implemented, but may not be utilizing the full power of the solution. To get the full value of the product, you need to get the solution implemented from industry experts.
At BigData Dimension, we have experience spanning over a decade integrating technologies around Data Analytics. As far asTalendgoes, we’re one of the few best-of-breed Talend-focused systems integrators in the entire world. So when it comes to your Talend deployment and getting the most out of it, we’re here for you with unmatched expertise.
Our work covers many different industries including Healthcare, Travel, Education, Telecommunications, Retail, Finance, and Human Resources.
We offer flexible delivery models to meet your needs and budget, including onshore and offshore resources. We can deploy and scale our talented experts within two weeks.
GETTING STARTED
-
Full requirements analysis of your infrastructure
-
Implementation, deployment, training, and ongoing services both cloud-based and/or on-premise
MEETING YOUR VARIOUS NEEDS
-
BigData Management by Talend: Leverage Talend Big Data and its built-in extensions for NoSQL, Hadoop, and MapReduce. This can be done either on-premise or in the cloud to meet your requirements around Data Quality, Data Integration, and Data Mastery
-
Cloud Integration and Data Replication: We specialize in integrating and replicating data into Redshift, Azure, Vertica, and other data warehousing technologies through customized revolutionary products and processes.
-
ETL / Data Integration and Conversion: Ask us about our groundbreaking product for ETL-DW! Our experience and custom products we’ve built for ETL-DI through Talend will give you a new level of speed and scalability
-
Data Quality by Talend: From mapping, profiling, and establishing data quality rules, we’ll help you get the right support mechanisms setup for your enterprise
-
Integrate Your Applications: Talend Enterprise Service Bus can be leveraged for your enterprise’s data integration strategy, allowing you to tie together many different data-related technologies, and get them to all talk and work together
-
Master Data Management by Talend: We provide end-to-end capabilities and experience to master your data through architecting and deploying Talend MDM. We tailor the deployment to drive the best result for your specific industry – Retail, Financial, Healthcare, Insurance, Technology, Travel, Telecommunications, and others
-
Business Process Management: Our expertise in Talend Open Studio will lead the way for your organization’s overall BPM strategy
WHAT WE DO
As a leading Systems Integrator with years of expertise in the latest and greatest integrating numerous IT technologies, we help you work smarter, not harder, and at a better Total Cost of Ownership. Our resources are based throughout the United States and around the world. We have subject matter expertise in numerous industries and solving IT and business challenges.
We blend all types of data and transform it into meaningful insights by creating high performance Big Data Lakes, MDM, BI, Cloud, and Mobility Solutions.
OUR CLOUD DATA LAKE SOLUTION
CloudCDCis equipped with the most intuitive and user friendly interface. With in a couple of clicks, you can load, transfer and replicate data to any platforms without any hassle. Do not worry about codes or scripts.
FEATURES
• Build Data Lake on AWS, Azure and Hadoop
• Continuous Real Time Data Sync.
• Click-to-replicate user interface.
• Automated Integration & Data Type Mapping.
• Automated Schema Build.
• Codeless Development Environment.
OUR SOLUTION ENHANCES DATA MANAGEMENT ACROSS INDUSTRIES
CONTACT THE EXPERTS AT BIGDATA DIMENSION FOR YOUR CLOUDCDC,TALEND, DATA ANALYTICS, AND BIG DATA NEEDS.