Home All Groups Group Topic Archive Search About

File Read Spanish characters

Author
9 Dec 2005 9:08 PM
Chip
There is surprisingly little information on the various encoding options for
reading a text file. I have what seems to be a very basic issue: I'm reading
a text file that includes Spanish characters such as "ñ". When I read the
file into a string, that character is missing. Encoding seems to be the
culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to
let us know what encoding to read the file with, but most software doesn't
do this so we are left with BOMless files. So how can we reliably read these
files without knowing what encoding it was written with?

Through trial and error I have found that using UTF-7 picks up these Spanish
characters, along with the English.
Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).

Since I am clueless on matters of encoding, my question is: am I safe using
UTF-7 if I only care about English and Spanish? What is the downside? I
won't be able to read Romanian? Japanese?

Is there a way to programatically find the correct encoding without the BOM?

Chip

Author
12 Dec 2005 8:08 PM
Joerg Jooss
Chip wrote:

> There is surprisingly little information on the various encoding
> options for reading a text file. I have what seems to be a very basic
> issue: I'm reading a text file that includes Spanish characters such
> as "ñ". When I read the file into a string, that character is
> missing. Encoding seems to be the culprit. File writers SHOULD begin
> a file with the BOM (Byte Order Mark) to let us know what encoding to
> read the file with, but most software doesn't do this so we are left
> with BOMless files.

Remember that these are byte order marks, which are intended to be used
for identifying whether an encoding uses Big Endian or Little Endian
representation. The fact that some encodings can be identified by their
BOM is just a nice side effect.

> So how can we reliably read these files without
> knowing what encoding it was written with?

Only through application specific meta data (like HTTP headers).
There's no grand universal scheme to tell a file's character encoding.

> Through trial and error I have found that using UTF-7 picks up these
> Spanish characters, along with the English.  Dim Reader As New
> StreamReader(fs, System.Text.Encoding.UTF7).

That's quite likely not what you want. Try Encoding.Default.

> Since I am clueless on matters of encoding, my question is: am I safe
> using UTF-7 if I only care about English and Spanish? What is the
> downside? I won't be able to read Romanian? Japanese?

Depends on the input. UTF-7 is only (and rarely?) used for E-mail. I
guess the chance to find a true UTF-7 encoded file is pretty much zero.

> Is there a way to programatically find the correct encoding without
> the BOM?

As I said, in general no. If the range of possible encodings is
limited, you may be able to create a proper detection algorithm, though.

Cheers,
--
http://www.joergjooss.de
mailto:news-re***@joergjooss.de
Author
14 Dec 2005 6:41 PM
Chip
Author
14 Dec 2005 7:04 PM
Juan T. Llibre
If you only care about english and spanish,
you'll be safe using iso-8859-1.



Juan T. Llibre
ASP.NET MVP
============
Show quote
"Chip" <c***@intradata.com> wrote in message news:%23wSpGTQ$FHA.3316@TK2MSFTNGP10.phx.gbl...
> There is surprisingly little information on the various encoding options for reading a text file.
> I have what seems to be a very basic issue: I'm reading a text file that includes Spanish
> characters such as "ñ". When I read the file into a string, that character is missing. Encoding
> seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us
> know what encoding to read the file with, but most software doesn't do this so we are left with
> BOMless files. So how can we reliably read these files without knowing what encoding it was
> written with?
>
> Through trial and error I have found that using UTF-7 picks up these Spanish characters, along
> with the English.
> Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7).
>
> Since I am clueless on matters of encoding, my question is: am I safe using UTF-7 if I only care
> about English and Spanish? What is the downside? I won't be able to read Romanian? Japanese?
>
> Is there a way to programatically find the correct encoding without the BOM?
>
> Chip
>
>

AddThis Social Bookmark Button