|
.net
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
File Read Spanish charactersreading a text file. I have what seems to be a very basic issue: I'm reading a text file that includes Spanish characters such as "ñ". When I read the file into a string, that character is missing. Encoding seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us know what encoding to read the file with, but most software doesn't do this so we are left with BOMless files. So how can we reliably read these files without knowing what encoding it was written with? Through trial and error I have found that using UTF-7 picks up these Spanish characters, along with the English. Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7). Since I am clueless on matters of encoding, my question is: am I safe using UTF-7 if I only care about English and Spanish? What is the downside? I won't be able to read Romanian? Japanese? Is there a way to programatically find the correct encoding without the BOM? Chip Chip wrote:
> There is surprisingly little information on the various encoding Remember that these are byte order marks, which are intended to be used> options for reading a text file. I have what seems to be a very basic > issue: I'm reading a text file that includes Spanish characters such > as "ñ". When I read the file into a string, that character is > missing. Encoding seems to be the culprit. File writers SHOULD begin > a file with the BOM (Byte Order Mark) to let us know what encoding to > read the file with, but most software doesn't do this so we are left > with BOMless files. for identifying whether an encoding uses Big Endian or Little Endian representation. The fact that some encodings can be identified by their BOM is just a nice side effect. > So how can we reliably read these files without Only through application specific meta data (like HTTP headers).> knowing what encoding it was written with? There's no grand universal scheme to tell a file's character encoding. > Through trial and error I have found that using UTF-7 picks up these That's quite likely not what you want. Try Encoding.Default.> Spanish characters, along with the English. Dim Reader As New > StreamReader(fs, System.Text.Encoding.UTF7). > Since I am clueless on matters of encoding, my question is: am I safe Depends on the input. UTF-7 is only (and rarely?) used for E-mail. I> using UTF-7 if I only care about English and Spanish? What is the > downside? I won't be able to read Romanian? Japanese? guess the chance to find a true UTF-7 encoded file is pretty much zero. > Is there a way to programatically find the correct encoding without As I said, in general no. If the range of possible encodings is> the BOM? limited, you may be able to create a proper detection algorithm, though. Cheers, If you only care about english and spanish,
you'll be safe using iso-8859-1. Juan T. Llibre ASP.NET MVP ============ Show quote "Chip" <c***@intradata.com> wrote in message news:%23wSpGTQ$FHA.3316@TK2MSFTNGP10.phx.gbl... > There is surprisingly little information on the various encoding options for reading a text file. > I have what seems to be a very basic issue: I'm reading a text file that includes Spanish > characters such as "ñ". When I read the file into a string, that character is missing. Encoding > seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us > know what encoding to read the file with, but most software doesn't do this so we are left with > BOMless files. So how can we reliably read these files without knowing what encoding it was > written with? > > Through trial and error I have found that using UTF-7 picks up these Spanish characters, along > with the English. > Dim Reader As New StreamReader(fs, System.Text.Encoding.UTF7). > > Since I am clueless on matters of encoding, my question is: am I safe using UTF-7 if I only care > about English and Spanish? What is the downside? I won't be able to read Romanian? Japanese? > > Is there a way to programatically find the correct encoding without the BOM? > > Chip > > |
|||||||||||||||||||||||