Variations for a theme

Sometimes you start to follow a thought and end up in unexpected places. When you find out not only that how deep is the rabbit hole, but you even start to dig extra tunnels.

A similar thing happened to me when I faced with the following problem: you have a file containing values separated with tab characters. The first row is the name of the columns. Something like this:

id  name  status  date  type
1 n1  s1  d1  t1
2 n2  s2  d2  t2
3 n3  s3  d3  t3

You need to get the values from the first column separated by spaces. For this example, this would be the result: 1 2 3. To ramp up the difficulty, we want to solve this in the command line and don't want to write a script. It had already had a solution also:

cat file.tsv | awk '(NR>1) { print $1 }'  |  tr '\n'  ' '

Although I'm also a big awk fan myself, this wasn't the first solution that came to my mind. But there is nothing wrong with that. In fact, this started me on the journey to find more alternative solutions.

Dissecting the problem

First, we should examine the task a little bit more. It can be divided into three parts:

  1. get rid of the first row
  2. get the first column from every row
  3. transform the rows into a single row (where the values are separated with spaces)

Now that we have a couple of smaller tasks, we can look for solutions for each one of them separately.

Getting rid of the first row

awk '(NR>1) { print }'

A bit forced example based on the original solution. If we went this far with awk, we could go even farther, but we will jump back to this a little later.

sed 1d

This one I found while digging looks like a pretty elegant solution. It deletes the first line and returns everything back without change.

tail +2

The more well-known tail -1 command returns the input from the last row until the end. This one returns from the second row until the end.

Keeping only the first column

awk '{ print $1 }'

A classic awk solution. Not much to say about it.

cut -f1

My personal favorite. It's a bit dumber than awk (for example, handling multiple separator characters next to each other wouldn't work that well), but it's still useful in many cases.

sed 's/^\([^\t]\+\).*/\1/'

You can solve anything with sed. But why would you do that? It's a huge help that we need the first column. It could also help if we know that the id is numeric only.

Making one single row

paste -s -d' ' -

The right tool for this job. It's an excellent choice.

sed -n 'H;${g;y/\n/ /;p}'

Not as friendly as the previous one, and it generates an extra space character at the start of the line. We will go into much more detail later about what this line really means.

tr '\n'  ' '

This works nicely. The only drawback is that it replaces the last newline, so we end up with an extra space at the end of the line.

xargs

It's an odd choice. The xargs command creates parameters from rows and passes them to another one. The default command happens to be echo, so it does exactly what we want. However, it does not work if we need to separate the values with anything other than spaces.

Complex solutions

We already have 36 different solutions, but you wouldn't write something like awk '(NR>1) { print }' | awk '{ print $1 }'. Sometimes a tool can solve multiple subtasks at once:

awk '(NR>1) { print $1 }'
sed -n '1!s/^\([^ ]\+\).*\n/\1/p'
perl -F'\t' -e 'print "$F[0] " if $i > 0; $i++'

Can we find a tool that can solve the whole problem? There should be a simple sed command that does what we want, right? Digging just got real. I jumped deep down to the sed documentation for answers. I was horrified by the things this tool can make. Like it was a love child of awk and vi. But my efforts weren't in vain. At long last, I emerged from the depth with a command:

sed -n '1!{s/^\([^\t]\+\).*/\1/;H};${g;y/\n/ /;s/^ //;p}'

It's trivial, right? It should have been the first thing I thought of. I don't want to drag anyone down to the abyss we know as sed, but this command could use some explanation.

sed basics

sed uses commands. Every command has a <filter><command><parameter> format. It runs on every row of the input where the filter is true (it feels a lot like awk in this regard). We can give more than one command if we separate them with a semicolon. Multiple commands can belong to a filter also if we put them between { and }.

One more thing worth mentioning is "pattern space". Initially, it contains the current row, and the commands can write back their output into it. Also, there is something called a "hold space". We could save the "pattern space" content into the "hold space" and later load it back.

The details of the solution

Nothing left to do than dissect the original command:

sed -n '1!{s/^\([^\t]\+\).*/\1/;H};${g;y/\n/ /;s/^ //;p}'

The -n flag tells sed to not produce any output by default. The part after that can be split into two commands. We have a 1!{...} and a ${...} filter block. The first one runs on every row except the first row, and the second one only runs on the last row.

The first filter block runs two commands, an s, and an H. The s replaces the whole row with the value of the first column, and the H command adds the current content to the hold space separated by a new line.

The second filter block runs four commands. A g, a y, an s, and at last a p. The g gets the content of the hold space (values from the first column separated with newlines), the y replaces the newlines with whitespaces, next the s removes the extra space from the start of the value, the p prints the result and we are finally done.

Interestingly enough, if we try to use another tool (like awk or a Perl one-liner), we get more or less the same result. Maybe because these tools are working with rows so we need something where we can hold the partial result.

This marks the end of our little journey into sed-land. Lessons learned? Using proper tools to solve subtasks is more effective than solving the whole problem with one generic tool. Have a nice text processing.

Ez a bejegyzés magyar nyelven is elérhető: Variációk egy témára

Have a comment?

Send an email to the blog at deadlime dot hu address, or visit the tweet related to this post.

Want to subscribe?

We have a good old fashioned RSS feed if you're into that, and you can also follow the blog on Twitter.