Websocket text encoding

benoit · May 11, 2021, 7:44am

Hello,

Using the class “WebSocket.class”, when sending a text message by the method sendText, the frame is build by using the method buildTextFrame of ClientFrameBuilder, in which the bytes of the String are put in an array by using the method getBytes() which uses the default encoding ISO-8859-1.
However, the RFC6455 section 5.6 defines the mandatory encoding for text is UTF-8.
As a result, when sending a string containing a char with its value over 0x7F, the websocket is closed by the server.

I don’t know if my understanding is correct, so please clarify.

Best regards

Benoit

alexis.pineau · May 11, 2021, 3:40pm

Hello Benoit,

Your understanding is correct, our implementation of the Websocket library does not comply with the RFC6455 on this point. We will fix this problem in the next Websocket version (planned for the next weeks).

Regards

Alexis

benoit · May 18, 2021, 9:33am

Hello Alexis,

I found the SNI.toJavaString method takes bytes as they are, regardless of their encoding, and make a string with one char = one byte. So, at the end, when the websocket sendText uses getBytes() (with default platform encoding), the original bytes are restored in the returned array.
Also, when the websocket receives an UTF-8 encoded text message from the server, the bytes are taken one by one to make a string.
So, the string are stored in a special way in String object, but in that way, there is no encoding issues.

The only case it can be a issue is when the string is made with a litteral text.

...
String test2 = new String("modèle"); // Incorrect
String test3 = toJavaString("modèle");  // Correct

//                'm',  'o',  'd'   'è'         'l'   'e' 
bytes[] btest4 = { 0x6d, 0x6f, 0x64, 0xc3, 0xa8, 0x6c, 0x65 };
String test4 = toJavaString(btest4);
...

/* The following function is similar to SNI.toJavaString(byte[] cString), but it doesn't require a NULL byte termination*/
String toJavaString(byte[] b) {
	return new String(b);
}

String toJavaString(String s) {
	byte[] b = s.getBytes("UTF-8");
	return toJavaString(b);
}

if we call the getBytes() on test2, we will not get an UTF-8 array : { 0x6d, 0x6f, 0x64, 0xe8, 0x6c, 0x65 }
If we call the getBytes() on test3, we will get an UTF-8 array : { 0x6d, 0x6f, 0x64, 0xc3, 0xa8 , 0x6c, 0x65 }

I just wanted to make things more clear. I would like you to avoid just putting the “UTF-8” argument in the getBytes() method when forming the message to send to the websocket.

Best regards

Benoit

alexis.pineau · May 21, 2021, 2:56pm

Hello Benoit,
Thank you for this feedback, we will take that into account for the Websocket fix.
Regards,
Alexis