Sunday, April 27, 2014

Unix xargs parallel execution of commands


Xargs has option that allows you to take advantage of multiple cores in your machine. Its -P option which allows xargs to invoke the specified command multiple times in parallel. From XARGS(1) man page:
-P max-procs
   Run up to max-procs processes at a time; the default is 1.  If max-procs is 0, xargs will run as many processes as possible at a time.   Use  the  -n  option
   with -P; otherwise chances are that only one exec will be done.

-n max-args
    Use at most max-args arguments per command line.  Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the  -x  
    option is given, in which case xargs will exit.

-i[replace-str]
    This option is a synonym for -Ireplace-str if replace-str is specified, and for -I{} otherwise.  This option is deprecated; use -I instead.
Let me try to give one example where we can make use of this parallel option avaiable on xargs. e.g. I got these 8 log files (each one is of 1.5G size) for which I have to run a script named count_pipeline.sh which does some calculation around the log lines in the log file.
$ ls -1 *.out
log1.out
log2.out
log3.out
log4.out
log5.out
log6.out
log7.out
log8.out
The script count_pipeline.sh takes nearly 20 seconds for a single log file. e.g.
$ time ./count_pipeline.sh log1.out

real 0m20.509s
user 0m20.967s
sys 0m0.467s
If we have to run count_pipeline.sh for each of the 8 log files one after the other, total time needed:
$ time ls *.out | xargs -i ./count_pipeline.sh {}           

real 2m45.862s
user 2m48.152s
sys 0m5.358s
Running with 4 parallel processes at a time (I am having a machine which is having 4 CPU cores):
$ time ls *.out | xargs -i -P4 ./count_pipeline.sh {} 

real 0m44.764s
user 2m55.020s
sys 0m6.224s
We saved time ! Isn't this useful ? You can also use -n1 option instead of the -i option that I am using above. -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args).
$ time ls *.out | xargs -n1 -P4 ./count_pipeline.sh

real 0m43.229s
user 2m56.718s
sys 0m6.353s

Related posts:
- UNIX handle kill when no PID available 

2 comments:

Ole Tange said...

If you like xargs -P you might want to check out GNU Parallel, which has much better control of how the jobs are run: http://pi.dk/1/ http://www.gnu.org/software/parallel/parallel_tutorial.html

Unknown said...

@Ole Tange, thanks a lot for sharing this, this is useful.

© Jadu Saikia www.UNIXCL.com