When constructing nice file names or URLs, it’s often nice to “slugify” a string, so it has a form of alphanumerics separated by dashes. For instance, you may have a string like this:
Linux clover 2.6.19-gentoo-r5 i686 Genuine Intel(R) CPU T2050 @ 1.60GHz
It has uppercase and lowercase letters, digits, brackets… you need to remove all but alphanumerics while retaining readability. Basically, you may want for instance:
linux-clover-2-6-19-gentoo-r5-i686-genuine-intel-r-cpu-t2050-1-60ghz
If you append “.html” to it, it makes a very nice URL, doesn’t it?
Here’s a part of a pipe chain that slugifies strings:
sed -e 's/[^[:alnum:]]/-/g' | tr -s '-' | tr A-Z a-z
If you have a shell script and you want to slugify variable content, you can:
SLUGIFIED="$(echo -n "${VARIABLE}" | sed -e 's/[^[:alnum:]]/-/g' \ | tr -s '-' | tr A-Z a-z)"
Note that wordpress likes to mess up quotes. They are meant to be plain, double ones.
Fiddled a lot with my locale variables, but couldn’t get neither coreutils’ own tr nor perl’s uc (and other) to correctly lowercase a string with polish diacritics. However, tcl’s puts [string tolower {STRING}] worked just right out of the box. Guess their legendary unicode support is a serious claim.
And while I’m writing this just thought of perl’s Text::Unaccent… Let’s see:
echo ‘ZAŻÓŁĆ’ | perl -MText::Unaccent -ne ‘print(lc(unac_string(“utf-8”, “$_”)))’
…seems to work just right.
This requires installing the Text::Unaccent CPAN module directly from CPAN or via your package manager. Either way, this solution will most probably not work with a basic Perl installation from a default OS install.