Home » Questions » Computers [ Ask a new question ]

SED and Unicode Quotation Marks

SED and Unicode Quotation Marks

When testing against this string:

Asked by: Guest | Views: 227
Total answers/comments: 1
Guest [Entry]

"What version of sed are you using? I believe that GNU sed should support Unicode characters, and your example works for me on Linux (Ubuntu, with UTF-8 environment).

If you are using a version of sed that is not Unicode-aware, your character group would break because it only matches one byte. If your command line is using a UTF-8 encoding, when you say “ a non-Unicode-aware sed would actually see three bytes, \xE2, \x80 and \x9C. This would cock up your character group which would only match one of those bytes at a time. Various other constructs would fail too, eg. a”? is the letter ‘a’ then two bytes followed by an optional third byte, so a on its own wouldn't match the expression though it looks like it should.

(You might want to consider also replacing the ellipsis character with three periods. Ellipsis is a compatibility character in Unicode; it's generally considered more modern to write out the periods and let the font take care of the typesetting.)"