May 24, 2011

How to export plain text to UTF-8

I explain about how to determine text file encoding:
File contains data: Hello

48 65 6C 6C 6F

This is the traditional ANSI encoding.

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.

FF FE 48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with BOM. The BOM (FF FE) serves two purposes: First, it tags the file as a Unicode document, and second, the order in which the two bytes appear indicate that the file is little-endian.

00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with no BOM. Notepad does not support this encoding.

FE FF 00 48 00 65 00 6C 00 6C 00 6F

This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM.

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

2B 2F 76 38 2D 48 65 6C 6C 6F

This is UTF-7 encoding



Here is a test example code:
---------------------
FileInputStream fileStream = new FileInputStream( "d:\\4.txt" );
byte[] arr = new byte[]{1,2,3};
fileStream.read(arr);
System.out.println(arr[0]);
System.out.println(arr[1]);
System.out.println(arr[2]);

System.out.println("...................");
System.out.println("utf-8:" + (byte)0xEF + " - " + (byte)0xBB + " - " + (byte)0xBF);//EF BB BF
System.out.println("big-endian: " + (byte)0xFE + " - " + (byte)0xFF);//FE FF
System.out.println("little-endian: " + (byte)0xFF + " - " + (byte)0xFE); //FF FE

0 Comment:

Post a Comment

Để chất lượng các bài viết ngày được tốt hơn, Bạn vui lòng để lại góp ý hoặc nhận xét vào khung bên dưới. Bạn có thể tự do nhận xét nhưng không trái với thuần phong mỹ tục. Khi gửi nhận xét xin vui lòng để lại: Tên, Địa chỉ mail hoặc địa chỉ Blog để tôi được biết bạn là ai. Xin cảm ơn!

Các bài liên quan




Recent Comments

Xã hội - VnExpress.net