Sqoop Troubleshooting
Performance
Apache Sqoop allows users to efficiently transfer data in bulk data between Apache Hadoop and relational databases. When transferring large amount of data it is important to keep performance consideration in mind to keep the execution time short and not impact the source RDBMS.
Increase Parallelism
A Sqoop job is a bunch of map tasks and so to tune a Sqoop job is similar to optimizing a map-reduce job. The first thing to consider is to to increase the number of parallel tasks to utilize the maximum available resources in the cluster. Use the -m flag and specify the number of mappers. For exports, use --batch option When using sqoop with Oozie, specify the --skip-dist-cache to skip the step where Sqoop copies its dependency to job cache and increasing i/o
Common Errors
Import failed: Can not create a Path from an empty string
When executing an oozie workflow, the sqoop job fails with -
Root Cause
Missing skip-dist-cache argument for the sqoop action
Solution
Add --skip-dist-cache argument to sqoop action
Last updated