I have for years been responsible for maintaining both Japanese and English
web sites, and doing web site development in PHP. We'll be expanding into
Korean and Chinese language web sites also, so have been looking into what
that will require.
As a native English speaker, and someone who cannot read Japanese, dealing
with Asian language encodings has been an education and a challenge. I
create the structure of the web site, and native Japanese speakers localize
the text of the web site.
The predominant Japanese encoding is shift-jis, which encodes Japanese
characters as two bytes. The challenge, programming wise, is that for
several common Japanese characters, the second byte of the two bytes is a
backslash (\).
This creates all kinds of grief. When handling strings, you sometimes have
to escape this second byte with another backslash, sometimes not, and of
course keep this all straight. This is just the most obvious problem that
comes to mind at the moment.
And, I must "know" that the data in the database is stored in shift-jis
encoding, rather than euc-jp, or iso-2022-jp, which are other possible
Japanese encodings. The iso-2022-jp encoding is used in email, so creating
email messages from information in a database involves converting data from
one encoding to another.
But, essentially, everything is in shift-jis encoding. And that appears to
be how most people have dealt with localization issues with BackupPC, and
their data in general. It seems that people create systems that can handle
English, and one other language encoding. Certainly, that is what I did.
But, now I'm going to also have to deal with multiple foreign languages
across several web sites. I doubted my ability to keep everything
straight, and it occurred to me that I might need to keep data from several
web sites in a single database table. Which encoding should *that* table
be in?
I have been investigating alternatives and have been switching web sites to
utf-8. The utf-8 encoding is a method for encoding unicode characters in
an 8-bit byte stream. Unicode allows encoding of all languages
simultaneously, and so you do not need to "know" what language you are
dealing with.
Many people think unicode means 16-bit characters, but that is just one way
of encoding the unicode character set. utf-8 is another encoding for
unicode characters, and it allows you to work with systems like legacy Unix
systems or database systems, that are 8-bit character oriented. Accented
latin characters usually take two bytes, asian characters usually take
three bytes. A huge advantage of utf-8 is that all multi-byte sequences
have the high-order bit turned on for each byte. This means that there are
never any "special" characters (like backslash) embedded in the middle of a
multi-byte sequence.
What does that have to do with BackupPC? This could be used at several levels.
One is the CGI interface. It has been localized for several
languages. All the translations could be provided (or converted) to
utf-8. Then the CGI code could always emit utf-8 encoded web pages, **and
not have to know anything about the language it is in**.
The other issue is file names. Although it would be nice to force everyone
else to use utf-8 on their machines because it is convenient for *you*,
that is not realistic. But if you knew the encoding of the client that you
were backing up, you could convert the encoding of the filename to/from
utf-8 as you were communicating with that client.
Thus, a new option that declared the charset of a client machine would
allow BackupPC to operate in a language-neutral way, changing the filename
encoding only when it talked to each client.
This may seem like a lot of work, but I suspect it is concentrated in just
a few places. And, from my experience, it greatly simplifies programs that
try to deal sanely with more than one encoding at a time.
Right now it seems everyone adds a few tweaks to get BackupPC to work for
their language, like adding just the right thing to $Conf{CgiHeaders}.
These tweaks are specific to each of the different European languages.
But as you get further from European languages, the problems increase (like
embedded backslashes in multi-byte sequences). The changes that Koichi
Kubo made for Japanese and Chinese hint at the additional
complexity. Standardizing on utf-8 internally seems like it could
eliminate a lot of these problems, make it easier for all non-English
speakers to use BackupPC, and have a standard BackupPC code base that
automatically supports all languages.
Marlin Prowell
Cadalog, Inc.
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/