SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
Chinese file name and locked file
Author Message
Post Chinese file name and locked file 
Hi,


I'm using BackupPC-2.0.2. I had a similar issue in Japanese, and I
added some changes.

1. smb.conf

[global]
dos charset = CP932
unix charset = EUCJP-MS
display charset = CP932


2. $Conf{CgiHeaders}

<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="Content-Type" content="text/html;
charset=$Conf{Charset}">


3. EscHTML in BackupPC_Admin

*** 1614,1620 ****
$s =~ s/\"/"/g;
$s =~ s/>/&gt;/g;
$s =~ s/</&lt;/g;
! $s =~ s{([^[:print:]])}{sprintf("&\#x%02X;", ord($1));}eg;
return \$s;
}

--- 1645,1658 ----
$s =~ s/\"/"/g;
$s =~ s/>/&gt;/g;
$s =~ s/</&lt;/g;
! if ($Conf{Language} eq 'ja') {
! $bpc->jconvert(\$s, $Conf{Charset}, $Conf{FsCharset});
! $s =~ s{([\x00-\x1f\x7f]])}{sprintf("&\#x%02X;", ord($1));}eg;
! } else {
! $s =~ s{([^[:print:]])}{sprintf("&\#x%02X;", ord($1));}eg;
! }
return \$s;
}

furthermore I changed Action_restoreFile in BackupPC_Admin, and the
commands BackupPC_tarCreate BackupPC_zipCreate for restoring files and
downloading.

Koichi




-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post Chinese file name and locked file 
"Koichi Kubo" writes:

I'm using BackupPC-2.0.2. I had a similar issue in Japanese, and I
added some changes.

1. smb.conf

[global]
dos charset = CP932
unix charset = EUCJP-MS
display charset = CP932

2. $Conf{CgiHeaders}

<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="Content-Type" content="text/html;
charset=$Conf{Charset}">

It looks like you added a new config variable, $Conf{Charset}.
What is the default value, and what value do you use for
Japanese?

3. EscHTML in BackupPC_Admin

*** 1614,1620 ****
$s =~ s/\"/"/g;
$s =~ s/>/&gt;/g;
$s =~ s/</&lt;/g;
! $s =~ s{([^[:print:]])}{sprintf("&\#x%02X;", ord($1));}eg;
return \$s;
}

--- 1645,1658 ----
$s =~ s/\"/"/g;
$s =~ s/>/&gt;/g;
$s =~ s/</&lt;/g;
! if ($Conf{Language} eq 'ja') {
! $bpc->jconvert(\$s, $Conf{Charset}, $Conf{FsCharset});
! $s =~ s{([\x00-\x1f\x7f]])}{sprintf("&\#x%02X;", ord($1));}eg;
! } else {
! $s =~ s{([^[:print:]])}{sprintf("&\#x%02X;", ord($1));}eg;
! }
return \$s;
}

What is the value of $Conf{FsCharset} and what does the subroutine
in $bpc->jconvert() do? How can this be generalized to other
charsets?

furthermore I changed Action_restoreFile in BackupPC_Admin, and the
commands BackupPC_tarCreate BackupPC_zipCreate for restoring files and
downloading.

What changes did you make to these files?

It looks like you also have a Japanese translation. If so, it would
be great if you could provide diffs against BackupPC-2.1.0beta0.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post Chinese file name and locked file 
I have for years been responsible for maintaining both Japanese and English
web sites, and doing web site development in PHP. We'll be expanding into
Korean and Chinese language web sites also, so have been looking into what
that will require.

As a native English speaker, and someone who cannot read Japanese, dealing
with Asian language encodings has been an education and a challenge. I
create the structure of the web site, and native Japanese speakers localize
the text of the web site.

The predominant Japanese encoding is shift-jis, which encodes Japanese
characters as two bytes. The challenge, programming wise, is that for
several common Japanese characters, the second byte of the two bytes is a
backslash (\).

This creates all kinds of grief. When handling strings, you sometimes have
to escape this second byte with another backslash, sometimes not, and of
course keep this all straight. This is just the most obvious problem that
comes to mind at the moment.

And, I must "know" that the data in the database is stored in shift-jis
encoding, rather than euc-jp, or iso-2022-jp, which are other possible
Japanese encodings. The iso-2022-jp encoding is used in email, so creating
email messages from information in a database involves converting data from
one encoding to another.

But, essentially, everything is in shift-jis encoding. And that appears to
be how most people have dealt with localization issues with BackupPC, and
their data in general. It seems that people create systems that can handle
English, and one other language encoding. Certainly, that is what I did.

But, now I'm going to also have to deal with multiple foreign languages
across several web sites. I doubted my ability to keep everything
straight, and it occurred to me that I might need to keep data from several
web sites in a single database table. Which encoding should *that* table
be in?

I have been investigating alternatives and have been switching web sites to
utf-8. The utf-8 encoding is a method for encoding unicode characters in
an 8-bit byte stream. Unicode allows encoding of all languages
simultaneously, and so you do not need to "know" what language you are
dealing with.

Many people think unicode means 16-bit characters, but that is just one way
of encoding the unicode character set. utf-8 is another encoding for
unicode characters, and it allows you to work with systems like legacy Unix
systems or database systems, that are 8-bit character oriented. Accented
latin characters usually take two bytes, asian characters usually take
three bytes. A huge advantage of utf-8 is that all multi-byte sequences
have the high-order bit turned on for each byte. This means that there are
never any "special" characters (like backslash) embedded in the middle of a
multi-byte sequence.

What does that have to do with BackupPC? This could be used at several levels.

One is the CGI interface. It has been localized for several
languages. All the translations could be provided (or converted) to
utf-8. Then the CGI code could always emit utf-8 encoded web pages, **and
not have to know anything about the language it is in**.

The other issue is file names. Although it would be nice to force everyone
else to use utf-8 on their machines because it is convenient for *you*,
that is not realistic. But if you knew the encoding of the client that you
were backing up, you could convert the encoding of the filename to/from
utf-8 as you were communicating with that client.

Thus, a new option that declared the charset of a client machine would
allow BackupPC to operate in a language-neutral way, changing the filename
encoding only when it talked to each client.

This may seem like a lot of work, but I suspect it is concentrated in just
a few places. And, from my experience, it greatly simplifies programs that
try to deal sanely with more than one encoding at a time.

Right now it seems everyone adds a few tweaks to get BackupPC to work for
their language, like adding just the right thing to $Conf{CgiHeaders}.
These tweaks are specific to each of the different European languages.

But as you get further from European languages, the problems increase (like
embedded backslashes in multi-byte sequences). The changes that Koichi
Kubo made for Japanese and Chinese hint at the additional
complexity. Standardizing on utf-8 internally seems like it could
eliminate a lot of these problems, make it easier for all non-English
speakers to use BackupPC, and have a standard BackupPC code base that
automatically supports all languages.


Marlin Prowell
Cadalog, Inc.




-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post Chinese file name and locked file 
Marlin,

Thanks for the tutorial! I really need to understand this better
so we can provide better i18n support.

Yes, the current CGI translations are western european, which
still use one byte per character.

There are three relevant i18n areas in BackupPC:

1. translation of the CGI text,

2. correct handling of file names to/from the XferMethod,

3. correct display of the file names in the CGI interface and
restore file.

For 1, translation probably should move towards utf-8 as you suggest.

For 2, currently BackupPC simply treats arriving file names as
sequences of bytes and shouldn't interpret then in any special
way. It does "mangle" the files in case they include special
characters like \r, \n or /. In theory, the same sequence of
bytes should be delivered back to the XferMethod during a
restore. Doug observed problems with smbclient where, internal
to smbclient, it was not reading files correctly. I assume an
appropriate codepage or charset setting would solve this.

The last area is where BackupPC is the weakest. It simply
emits the sequence of bytes as the file name, after doing
the usual html escapes (eg: <>&", and using the &#xx hex
form for non-printable characters. For western charsets
this will often work ok, won't work when the XferMethod
and html have different charsets.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post Chinese file name and locked file 
At 10:52 PM 3/25/2004 -0800, Craig Barratt wrote:
There are three relevant i18n areas in BackupPC:

1. translation of the CGI text,

2. correct handling of file names to/from the XferMethod,

3. correct display of the file names in the CGI interface and
restore file.

For 1, translation probably should move towards utf-8 as you suggest.

Yes, but see below.

For 2, currently BackupPC simply treats arriving file names as
sequences of bytes and shouldn't interpret then in any special
way. It does "mangle" the files in case they include special
characters like \r, \n or /. In theory, the same sequence of
bytes should be delivered back to the XferMethod during a
restore. Doug observed problems with smbclient where, internal
to smbclient, it was not reading files correctly. I assume an
appropriate codepage or charset setting would solve this.

Using utf-8 would insert an additional step - first converting from the
"native" charset of the client, and then applying the "mangling". Now, I
suppose that it is possible for filenames to contain \r or \n, and using
utf-8 won't change that. In the utf-8 encoding, all characters in the
range 0-127 keep their "traditional" meaning, just as in western European
character sets. All other characters are transformed into a multi-byte
sequence, with each byte >= 128. Thus, there are never hidden \r, \n, /,
or \\ characters inside a multi-byte sequence. Believe me, if you have not
dealt with that yet, you don't want to start. It will be much easier to
move to utf-8 instead.

I haven't done much in perl for charset conversions, but I believe that
Text::Iconv is the right tool here for translating to and from the native
charset. I use iconv() a lot when converting charsets.

The last area is where BackupPC is the weakest. It simply
emits the sequence of bytes as the file name, after doing
the usual html escapes (eg: <>&", and using the &#xx hex
form for non-printable characters. For western charsets
this will often work ok, won't work when the XferMethod
and html have different charsets.

Or, if you are dealing with a multi-byte character set, perhaps shift-jis,
you have to be careful *not* to escape the second byte in a multi-byte
sequence, even if it is a backslash. So, with multi-byte encodings, you
cannot examine single bytes out of context and turn them into &#xx. For
example, in PHP, you have to replace all calls to string handling routines
to multi-byte aware string handling routines. (Actually, that is not
literally true, there are ways to fake this in PHP, but I think it is
better for the developer to be explicitly aware of this, and to explicitly
handle it.) Using utf-8 allows you to ignore these problems.

There are other benefits to using utf-8 internally. You could restore
files to a machine that has a different encoding than the original
client. For example, you could restore Swedish Windows files to a Unix
Samba 3.0 server. Samba 3.0 uses unicode internally, not the Swedish
codepage. As long as you know the "native" charset of the target, you can
transfer files from a machine with one encoding to a machine with a
different encoding. Within reason. I don't think transferring Chinese
files to English Windows 95 will be very successful.

One last point. All three areas must be converted from native charset to
utf-8 simultaneously. It won't work to just convert the string
translations to utf-8 unless everything else you emit for a web page (like
a file name) is also in utf-8 encoding.

After saying all this, I'll admit that I haven't looked at BackupPC yet to
see what exactly needs to be done. Nor have I thought much about how you
get from native file names to utf-8 file names on the BackupPC server when
converting from one system to another. But I am interested in the problem,
and will look some soon. And I have done some work with converting
directory hierarchies from one encoding to another, and have a perl script
that can do that, so I may have some useful pieces laying about.

We can also take this conversation off-line so we don't clutter up the
mailing list.


Marlin Prowell
Cadalog, Inc.

Post Chinese file name and locked file 
Marlin Prowell writes:

[snip]

Thanks for the excellent education.

Native utf-8 support is now on the todo list. I agree that file names,
internals, and html should be in utf-8. Perl's native support in 5.8.x
should make this relatively painless. I do need to verify whether you
can reliably create a utf-8 file name (with arbitrary bytes 0x80-0xff)
on a typical *nix file system, or whether mangling needs to protect more
of the bytes (I don't care as much about the file names looking sensible
when a user looks at the backup directories manually). I also need
to understand what charsets smbclient, tar and rsync use, and how
you set them.

We can also take this conversation off-line so we don't clutter up the
mailing list.

Let's move the discussion to backuppc-devel < at > lists.sourceforge.net.

I'll take this up after 2.1.0 is done.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB