Sqoop Troubleshooting

Performance

Apache Sqoop allows users to efficiently transfer data in bulk data between Apache Hadoop and relational databases. When transferring large amount of data it is important to keep performance consideration in mind to keep the execution time short and not impact the source RDBMS.

Increase Parallelism

A Sqoop job is a bunch of map tasks and so to tune a Sqoop job is similar to optimizing a map-reduce job. The first thing to consider is to to increase the number of parallel tasks to utilize the maximum available resources in the cluster. Use the -m flag and specify the number of mappers. For exports, use --batch option When using sqoop with Oozie, specify the --skip-dist-cache to skip the step where Sqoop copies its dependency to job cache and increasing i/o

Common Errors

Import failed: Can not create a Path from an empty string

When executing an oozie workflow, the sqoop job fails with -

ERROR org.apache.sqoop.tool.ImportTool - Imported Failed: Can not create a Path from an empty string

Root Cause

Missing skip-dist-cache argument for the sqoop action

Solution

Add --skip-dist-cache argument to sqoop action

Last updated