Home » Questions » Computers [ Ask a new question ]

Removing newlines from an RTF file using sed

Removing newlines from an RTF file using sed

I have an RTF file which is formatted like so:

Asked by: Guest | Views: 300
Total answers/comments: 1
Guest [Entry]

"The solution lied in a tool I haven't given serious thought - awk

awk 'BEGIN { FS=""\\\\par"" } ; /^ / {print ""\\par"" $1} /^[^ ]/ {print "" "" $1}'

This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

tr -d '\r\n'

And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

sed 's/\\par/\\par\r\n/g'

And done.

The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

CHAPTER THIRTY-TWO
Chapter's Name

So a quick sed took care of them:

sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'

I now have my book in proper format, which makes it readable on other devices (such as my iPod)."