Weird sed behavior when matching capital letters

Gentoo

maciej@matilda ~ $ cat /etc/gentoo-release

Gentoo Base System version 1.6.14

maciej@matilda ~ $ sed –version

GNU sed version 4.1.4

maciej@matilda ~ $ echo ' <Field name="hospital_id">VHI</Field>' | sed -e 's/^\(.*\)\([A-Z]\{3\}\)/\2/'

VHI</Field>

Debian

krok:~# cat /etc/debian_version

3.1

krok:~# sed –version

GNU sed version 4.1.2

krok:~# echo ' <Field name="hospital_id">VHI</Field>' | sed -e 's/^\(.*\)\([A-Z]\{3\}\)/\2/'

eld>

Difference

So Gentoo sed 4.1.4 returns “VHI</Field>” while Debian sed 4.1.2 returns “eld>”. How is that?

Is it the sed itself or some underlying regular expression library?

Solution

Don't use [A-Z]. Use [[:upper:]].

Problem was caused by locale. The script was originally written under POSIX locale and failed on pl_PL.

debian:~# export LC_ALL=POSIX
debian:~# echo abcdz | sed 's/[A-Z]/x/g'
abcdz
debian:~# export LC_ALL=pl_PL
debian:~# echo abcdz | sed 's/[A-Z]/x/g'
xxxxz

Looks like the collation order under pl_PL is much different than under C or POSIX locale.

Conclusion: don't use [A-Z] for matching the capital letters.

Advertisements

Author: automatthias

You won't believe what a skeptic I am.