-------------------------------------------------------------------------------- datamash: basic data analysis -------------------------------------------------------------------------------- datamash is a command line data (numerical and strings) analysis tool. It is designed for strict input validation and robust. awk can do more than datamash but it is not built to check if the data are valid. https://github.com/agordon/datamash/tree/master/examples [examples] https://www.gnu.org/software/datamash/alternatives/ [one-liners, cf awk] $ datamash OPTION op field op field [stdin] op refers to operations OPTIONS: -C ... skip comment lines ("^ *#" or "^ *;") -t "X" ... by default IFS=OFS=\t. Use "X" instead. --output-delimter=OFS -W ... IFS=OFS=whitespace -f print entire line (default is to print only grouped keys) -g groupby fields -h first line of input and output is header -i ... ignore case -s ... sort the column -R ... round ouput -z ... lines end with NUL instead of \n Primary Operations: groupby ... -g crosstab X, Y transpose reverse check [N lines] [N fields] Statistical: mean,trimean,median,q1,q3,iqr,perc,mode,antimode,stdev,var,mad,skew,kurt,jarque,dpo,cov,pearson Numeric: sum,min,max,absmin,absmx,range Textual/Numeric: first,last,select randomly,unique,countunique,comma-separated output list (collapse) Line Filtering: rmdup ... remove duplicate lines Per-line operations: base64,debase64,bin(histogram),strbin(hash),round/floor/ceil/trunc/frac dirname/basename,extension name, barename extract number from string cut (copy input field to output field) ------------------------------------------------------------------------ I. Analyze by columns ------------------------------------------------------------------------ Let us generate a set of 10 random numbers $ jot -r 10 > a $ cat a 62 18 31 4 56 95 86 4 78 81 The general format of the command line is $ datamash "op" "col_number" "op" "col_number" b $ cat b value 62 18 31 4 56 95 86 4 78 81 In this case, to skip the header, $ datamash --header-in mean 1