Viki's Blog: 2016

Friday, November 4, 2016

Full join in Talend

Introduction:

There is no shared idea or pre-defined solution to achieve Full join in Talend. Hence I am sharing my idea.

What need to be achieved using Full Join?

Consider the below RecordSet,

Table1

EmpId	EmpName	Salary
ID1	Emp1	15000
ID2	Emp2	16000
ID3	Emp3	17000

Table2

EmpId	Emp_AGE	EmP_Address
ID1	25	Chennai
ID2	35	Bangalore
ID4	30	Chennai

If you notice, Table1 does not contain data for ID4 and Table2 does not contain data for ID3.

The result-set that we need to get using Full Join is

EmpId	EmpName	EmpSalary	EmpAge	EmpAddress
ID1	Emp1	15000	25	Chennai
ID2	Emp2	16000	35	Bangalore
ID3	Emp3	17000
ID4			30	Chennai

How to achieve Full Join?

Follow below steps to achieve Full Join,

Step 0: Consider the below two tables (Temporary data creation)

Table1

Table2

Step 1: Store the record-set in Hash components

Fetch data from two tables and store each record-set into separate tHashOutput component as shown below,

Table1

Table2

Step 2: Map dataset in Table1 and Table2

Map data that is stored in tHashOutput_1 and tHashOutput_2 with Table1 data as Main table and Table2 data as LookUp table in tMap component.

Mapping in tMap,

Step 3: Store the output in tHashOutput

Store the output which contains data from Table1 as Main table and Table2 as LookUp table.

Step 4: Follow Step 2 and 3 with Table2 as Main table and Table1 as LookUp table

See below image for mapping,

Step 5: Merge two result-sets and get unique records

Now merge two result-set using tUnite component and get the unique records using tUniqueRow component.

Final output:

Reference:

Wednesday, May 18, 2016

Another simple way to remove duplicate data

Purpose:

In my previous blog, I have explained a way to avoid duplicate records. Here I am with a simple query to achieve the same.

SELECT DISTINCT column1, column2, column3, .......

FROM tablename;

SELECT column1, column2, column3,....
FROM(
SELECT COUNT(1) count_value, column1, column2, column3, .......

FROM tablename
GROUP BY column1, column2, column3, .......);

OR

SELECT column1, column2, column3, .......

FROM tablename
GROUP BY column1, column2, column3, .......;

If you notice the sub-query contains COUNT(1), which is not in the main query.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000
4	DDDD	20000

SELECT COUNT(1) count_value, EmpId, EmpName, Salary

FROM Employee
GROUP BY EmpId, EmpName, Salary;

The above query will return the below result

count_value	EmpId	EmpName	Salary
1	1	AAAA	20000
2	2	BBBB	30000
1	3	CCCC	25000
2	4	DDDD	20000

It is clear from the above query that EmpId 2 and EmpId 4 has duplicate records.

So with the above query, we can able to group the duplicate records.

Now create a temporary table of the same schema and insert data into that temp table using the below query

INSERT into Employee_temp
SELECT EmpId, EmpName, Salary
FROM(
SELECT COUNT(1) count_value, EmpId, EmpName, Salary

FROM Employee
GROUP BY EmpId, EmpName, Salary);

OR
INSERT into Employee_temp
SELECT EmpId, EmpName, Salary

FROM Employee

GROUP BY EmpId, EmpName, Salary;

OR
INSERT into Employee_temp
SELECT DISTINCT EmpId, EmpName, Salary

FROM Employee;

Now the temp table will have unique data as below

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Note: This query will not work for the below records, if you need all the columns.

EmpId	EmpName	Salary	Department
1	AAAA	20000	Dept1
2	BBBB	30000	Dept1
2	BBBB	30000	Dept2
3	CCCC	25000	Dept2
4	DDDD	20000	Dept2
4	DDDD	20000	Dept3

If you note this, the department for EmpId 2 and EmpId 4 are different in both the records.

If you do not need the department column, then this query will work.

Hope this should avoid duplicate records or delete duplicate records.

Efficient usage of upload of multiple files to Amazon S3 bucket and parallel copying of those file data into Amazon RedshiftDB

Introduction:

In this blog, let we discuss about some interest topic that how to upload multiple files into Amazon S3 bucket efficiently and faster using Talend Open Studio. Then we can see about how to copy data from those multiple files into multiple Amazon Redshift DB tables efficiently and faster using Talend Open Studio.

Purpose:

I have already discussed about transferring data from MySQL database into Amazon Redshift database using Amazon S3 bucket in my previous post. Check this out first.

Suppose imagine a scenario where I need to transfer data from 100 tables in MySQL to 100 tables in Redshift and each table has, say 1 million records.

To do this, the below process has to be done 100 times.

Obviously this process has to be under taken 100 times. But the question here is that, do we need to do this as parallel process or sequential process ?

Of course, we know that, doing sequentially takes much time.

So parallel processing is the best way to achieve it faster.

How to do it?

Below is an approach to achieve this using Talend Open Studio

Step 1: Job to generate files

The first step that we have to do is we need to create job or jobs to generate files that need to be uploaded to Amazon S3 bucket. (JobName: GenerateFile)

The job can be single or multiple which is based on your requirement. To create a single job that has dynamic schema, Refer this.

The job reads data from MySQL table and generate a CSV file.

The file name and file path can be set dynamically based on the context variables as shown in the above image.

Step 2: Upload files to S3 bucket

The next step is to create a single job to upload the generated files to S3 bucket. (JobName: UploadFile)

This job will upload the generated file to the provided S3 bucket.

The source file path and the target file path can be set dynamically based on the context variables.

Step 3: Copy the file data to Amazon Redshift tables

Third step is to create a single job to copy the file data to redshift tables. (JobName: CopyFile)

This job will copy the data from file into redshift table.

The copy query can be dynamically generated as show in the above image. The table name and file name can be dynamically set with the context variables.

Step 4: Call generate file job 20 times simultaneously

Now we have all the three jobs that has to be run simultaneously more than 1 time. So how to call the job more than once simultaneously? Below is the example job (JobName: GenerateFileIntermediate)

The job contain 20 tRunJob component which call the same job with different context variables.

If you notice, each component has a parameter named TableName, which is a context variable, and that parameter value will be differ on each component.

Step 5: Call upload file job 20 times

Create a job that contains 20 tRunJob component and each component call the upload job. (JobName: UploadFileIntermediate)

Each component can contain different parameter value.

Each component has different value for parameter TableName as shown in above image.

Step 6: Call Copy job 20 times

Create a job to call the copy job 20 times. (JobName: CopyIntermediate)

If you noticed, I made only 10 jobs run simultaneously and the other 10 jobs as sequential jobs. This is because, the 10 parallel job is based on the number of child core in your redshift cluster. If your cluster has only 6 then split the jobs as 6 rather than 10.

Each tRunJob component will have different parameter value as in the above image.

Step 7: Final job

Now create the final job which is the master job that invoke other jobs.

If you notice, the GenerateIntermediate job is called first, which in-turn call GenerateJob 20 times. When generate 20 files is finished, the UploadIntermediate job is invoked, which call UploadJob 20 times. Then CopyIntermediate job is called, which calls CopyJob 20 times.

Conclusion:

The concept here is that, separate job is created for generating file, uploading file to S3 and copying data to redshift tables. Then those each jobs are called multiple time with different parameters and finally the master job which calls the parallel job.

By this method we have achieved the efficient usage of parallel processing in Amazon S3 and Amazon Redshift.

Monday, January 25, 2016

Datetime sent by Talend component and Datetime received in MySQL Stored Procedure are not same

Purpose:

I have created a job that picks a datetime from Redshift and send that datetime to MySQL SP. My expected result was not achieved. Then I figured out that the datetime sent by component tMysqlSP is different from the datetime received at MySql Stored Procedure. I solved the issue as below.

Problem is :

Talend send DateTime, say 2016-15-01 04:28:09 to SP. But the SP receives 2016-15-01 00:00:00.

Solution:

I made Talend component to send String rather than DateTime, i.e., I sent "2016-15-01 04:28:09" rather than 2016-15-01 04:28:09.