26 September 2011

Incestuous Sed, or, 's_..*_s/\\"&\\"/\\\\\\"&\\\\\\"/g_g'

I have a CSV file to import; I have no control over the producer of this file, and its output is unfortunately non-conforming; it encloses every non-numeric field in double quotes, and fails to escape double quotes within the field. In other words, I have something like this:

1,"Foo","A "Lord of the Rings" expert","Blah",123.45

While it is possible to imagine a parser that might be able to cope with this, in my case I'm importing into MySQL ("load data infile ...") and MySQL has no plans to accommodate this kind of CSV right now today. In order to import this, I need to transform it to the following:

1,"Foo","A \"Lord of the Rings\" expert","Blah",123.45

One solution is to detect quotes that are not part of the pattern /","/, but that gets tricky for first, last, and numeric columns. Given that the data is finite and changes slowly, I decided to write down the expressions that needed fixing, and write a sed expression to fix them.

To fix the above, all I need is

cat data.csv | sed 's/\"Lord\ of\ the\ Rings\"/\\\"Lord\ of\ the\ Rings\\\"/g' > clean-data.csv

But given a list of expressions to fix, I don't want to go the error-prone way of typing out all these sed commands line-by-line, making sure to escape all the spaces and other special chars, and counting backslashes. What can I use to transform these expressions into sed commands? Why, sed, of course! Here's how I transform a list of expressions into a list of corresponding quote-escaping sed commands, for use immediately afterwards in the same script:

cat quoted_terms.txt | sed -e 's/( |\?)/\\\&/g' -e 's_.*_s/\\"&\\"/\\\\\\"&\\\\\\"/g_g' > clean.sed
  cat data.csv | sed -f clean.sed > clean-data.csv

Yes, it's Backslash Hell!! The first line transforms this:

Lord of the Rings
The Canterbury Tales
Is Anybody Home?

into this:

s/\"Lord\ of\ the\ Rings\"/\\\"Lord\ of\ the\ Rings\\\"/g
s/\"The\ Canterbury\ Tales\"/\\\"The\ Canterbury\ Tales\\\"/g
s/\"Is\ Anybody\ Home\?\"/\\\"Is\ Anybody\ Home\?\\\"/g

Then the second line transforms this:

1,"Foo","A "Lord of the Rings" expert","Blah",123.45
2,"Bar","Read all of "The Canterbury Tales"","Blah",234.56
4,"Titi","Asked "Is Anybody Home?"","Blah",456.78

into this:

1,"Foo","A \"Lord of the Rings\" expert","Blah",123.45
2,"Bar","Read all of \"The Canterbury Tales\"","Blah",234.56
4,"Titi","Asked \"Is Anybody Home?\"","Blah",456.78

And voilĂ , clean csv all ready for import ... all thanks to the power of sed to mate with itself and generate more sed ...

Now, you can go and enjoy Sed - An Introduction and Tutorial by Bruce Barnett, because I'm not going to try explain all those backslashes