文字列をバイト配列に変換してJavaで元に戻す

Javaで文字列をバイト配列に変換し、逆にする

1. 前書き

Javaでは、String配列とbyte配列の間で変換する必要があることがよくあります。このチュートリアルでは、これらの操作について詳しく説明します。

まず、Stringをbyte配列に変換するさまざまな方法を見ていきます。次に、同様の操作を逆に見ていきます。

2. StringをByte配列に変換する

Stringは、JavaではUnicode文字の配列として格納されます。これをbyte配列に変換するために、文字のシーケンスをバイトのシーケンスに変換します。この変換では、we use an instance of Charset. This class specifies a mapping between a sequence of chars and a sequence of bytes。

上記のプロセスをencodingと呼びます。

Javaでは、さまざまな方法でStringをbyte配列にエンコードできます。それぞれを例を挙げて詳しく見ていきましょう。

2.1. String.getBytes()の使用

The String class provides three overloaded getBytes methods to encode a String into a byte array：

getBytes() –プラットフォームのデフォルトの文字セットを使用してエンコードします
getBytes (String charsetName) –名前付き文字セットを使用してエンコードします
getBytes (Charset charset) –提供された文字セットを使用してエンコードします

まず、let’s encode a string using the platform’s default charset:

String inputString = "Hello World!";
byte[] byteArrray = inputString.getBytes();

上記の方法は、プラットフォームのデフォルトの文字セットを使用するため、プラットフォームに依存します。この文字セットは、Charset.defaultCharset()を呼び出すことで取得できます。

次に、let’s encode a string using a named charset:

@Test
public void whenGetBytesWithNamedCharset_thenOK()
  throws UnsupportedEncodingException {
    String inputString = "Hello World!";
    String charsetName = "IBM01140";

    byte[] byteArrray = inputString.getBytes("IBM01140");

    assertArrayEquals(
      new byte[] { -56, -123, -109, -109, -106, 64, -26,
        -106, -103, -109, -124, 90 },
      byteArrray);
}

名前付き文字セットがサポートされていない場合、このメソッドはUnsupportedEncodingExceptionをスローします。

入力に文字セットでサポートされていない文字が含まれている場合、上記の2つのバージョンの動作は未定義です。対照的に、3番目のバージョンでは、文字セットのデフォルトの置換バイト配列を使用して、サポートされていない入力をエンコードします。

次に、let’s call the third version of the getBytes() method and pass an instance of Charset:

@Test
public void whenGetBytesWithCharset_thenOK() {
    String inputString = "Hello ਸੰਸਾਰ!";
    Charset charset = Charset.forName("ASCII");

    byte[] byteArrray = inputString.getBytes(charset);

    assertArrayEquals(
      new byte[] { 72, 101, 108, 108, 111, 32, 63, 63, 63,
        63, 63, 33 },
      byteArrray);
}

ここでは、ファクトリメソッドCharset.forNameを使用して、Charsetのインスタンスを取得しています。このメソッドは、要求された文字セットの名前が無効な場合、ランタイム例外をスローします。また、現在のJVMで文字セットがサポートされている場合、ランタイム例外がスローされます。

ただし、一部の文字セットはすべてのJavaプラットフォームで使用できることが保証されています。 StandardCharsetsクラスは、これらの文字セットの定数を定義します。

最後に、let’s encode using one of the standard charsets:

@Test
public void whenGetBytesWithStandardCharset_thenOK() {
    String inputString = "Hello World!";
    Charset charset = StandardCharsets.UTF_16;

    byte[] byteArrray = inputString.getBytes(charset);

    assertArrayEquals(
      new byte[] { -2, -1, 0, 72, 0, 101, 0, 108, 0, 108, 0,
        111, 0, 32, 0, 87, 0, 111, 0, 114, 0, 108, 0, 100, 0, 33 },
      byteArrray);
}

したがって、さまざまなgetBytesバージョンのレビューを完了します。次に、Charset自体が提供するメソッドを調べてみましょう。

2.2. Charset.encode()の使用

The Charset class provides encode(), a convenient method that encodes Unicode characters into bytes.このメソッドは、文字セットのデフォルトの置換バイト配列を使用して、無効な入力文字とマップ不可能な文字を常に置換します。

encodeメソッドを使用して、Stringをbyte配列に変換してみましょう。

@Test
public void whenEncodeWithCharset_thenOK() {
    String inputString = "Hello ਸੰਸਾਰ!";
    Charset charset = StandardCharsets.US_ASCII;

    byte[] byteArrray = charset.encode(inputString).array();

    assertArrayEquals(
      new byte[] { 72, 101, 108, 108, 111, 32, 63, 63, 63, 63, 63, 33 },
      byteArrray);
}

上記のように、サポートされていない文字は、文字セットのデフォルトの置換byte63に置き換えられています。

これまでに使用されたアプローチでは、CharsetEncoderクラスを内部的に使用してエンコードを実行します。次のセクションでこのクラスを調べてみましょう。

2.3. CharsetEncoder

CharsetEncoder transforms Unicode characters into a sequence of bytes for a given charset。 Moreover, it provides fine-grained control over the encoding process。

このクラスを使用して、Stringをbyte配列に変換してみましょう。

@Test
public void whenUsingCharsetEncoder_thenOK()
  throws CharacterCodingException {
    String inputString = "Hello ਸੰਸਾਰ!";
    CharsetEncoder encoder = StandardCharsets.US_ASCII.newEncoder();
    encoder.onMalformedInput(CodingErrorAction.IGNORE)
      .onUnmappableCharacter(CodingErrorAction.REPLACE)
      .replaceWith(new byte[] { 0 });

    byte[] byteArrray = encoder.encode(CharBuffer.wrap(inputString))
                          .array();

    assertArrayEquals(
      new byte[] { 72, 101, 108, 108, 111, 32, 0, 0, 0, 0, 0, 33 },
      byteArrray);
}

ここでは、CharsetオブジェクトでnewEncoder メソッドを呼び出して、CharsetEncoderのインスタンスを作成しています。

次に、onMalformedInput()およびonUnmappableCharacter() methods. を呼び出して、エラー状態のアクションを指定します。次のアクションを指定できます。

IGNORE –誤った入力を削除します
REPLACE –誤った入力を置き換えます
レポート–CoderResultオブジェクトを返すか、CharacterCodingExceptionをスローしてエラーを報告します

さらに、replaceWith()メソッドを使用して、置換byte配列を指定しています。

したがって、文字列をバイト配列に変換するさまざまなアプローチのレビューを完了します。次に、逆の操作を見てみましょう。

3. バイト配列を文字列に変換する

We refer to the process of converting a byte array to a String as decoding。エンコードと同様に、このプロセスにはCharsetが必要です。

ただし、バイト配列のデコードに文字セットを使用することはできません。 We should use the charset that was used to encode the String into the byte array。

バイト配列をさまざまな方法で文字列に変換できます。それぞれについて詳しく見ていきましょう。

3.1. Stringコンストラクターの使用

The String class has few constructors which take a byte array as input。これらはすべてgetBytesメソッドに似ていますが、逆に機能します。

まず、let’s convert a byte array to String using the platform’s default charset:

@Test
public void whenStringConstructorWithDefaultCharset_thenOK() {
    byte[] byteArrray = { 72, 101, 108, 108, 111, 32, 87, 111, 114,
      108, 100, 33 };

    String string = new String(byteArrray);

    assertNotNull(string);
}

ここでは、デコードされた文字列の内容について何も主張していないことに注意してください。これは、プラットフォームのデフォルトの文字セットによっては、別の文字にデコードされる可能性があるためです。

このため、一般的にこの方法は避けてください。

次に、let’s use a named charset for decoding:

@Test
public void whenStringConstructorWithNamedCharset_thenOK()
    throws UnsupportedEncodingException {
    String charsetName = "IBM01140";
    byte[] byteArrray = { -56, -123, -109, -109, -106, 64, -26, -106,
      -103, -109, -124, 90 };

    String string = new String(byteArrray, charsetName);

    assertEquals("Hello World!", string);
}

名前付き文字セットがJVMで使用できない場合、このメソッドは例外をスローします。

第三に、let’s use a Charset object to do decoding:

@Test
public void whenStringConstructorWithCharSet_thenOK() {
    Charset charset = Charset.forName("UTF-8");
    byte[] byteArrray = { 72, 101, 108, 108, 111, 32, 87, 111, 114,
      108, 100, 33 };

    String string = new String(byteArrray, charset);

    assertEquals("Hello World!", string);
}

最後に、let’s use a standard Charset for the same:

@Test
public void whenStringConstructorWithStandardCharSet_thenOK() {
    Charset charset = StandardCharsets.UTF_16;

    byte[] byteArrray = { -2, -1, 0, 72, 0, 101, 0, 108, 0, 108, 0,
      111, 0, 32, 0, 87, 0, 111, 0, 114, 0, 108, 0, 100, 0, 33 };

    String string = new String(byteArrray, charset);

    assertEquals("Hello World!", string);
}

これまで、コンストラクタを使用してbyte配列をStringに変換してきました。それでは、他のアプローチを見てみましょう。

3.2. Charset.decode()の使用

Charsetクラスは、ByteBufferをStringに変換するdecode()メソッドを提供します。

@Test
public void whenDecodeWithCharset_thenOK() {
    byte[] byteArrray = { 72, 101, 108, 108, 111, 32, -10, 111,
      114, 108, -63, 33 };
    Charset charset = StandardCharsets.US_ASCII;
    String string = charset.decode(ByteBuffer.wrap(byteArrray))
                      .toString();

    assertEquals("Hello �orl�!", string);
}

ここで、the invalid input is replaced with the default replacement character for the charset.

3.3. CharsetDecoder

内部でデコードするためのこれまでのアプローチはすべて、CharsetDecoderクラスを使用します。 We can use this class directly for fine-grained control on the decoding process：

@Test
public void whenUsingCharsetDecoder_thenOK()
  throws CharacterCodingException {
    byte[] byteArrray = { 72, 101, 108, 108, 111, 32, -10, 111, 114,
      108, -63, 33 };
    CharsetDecoder decoder = StandardCharsets.US_ASCII.newDecoder();

    decoder.onMalformedInput(CodingErrorAction.REPLACE)
      .onUnmappableCharacter(CodingErrorAction.REPLACE)
      .replaceWith("?");

    String string = decoder.decode(ByteBuffer.wrap(byteArrray))
                      .toString();

    assertEquals("Hello ?orl?!", string);
}

ここでは、無効な入力とサポートされていない文字を「？」に置き換えています。

入力が無効な場合に通知を受け取りたい場合は、decoderを次のように変更できます。

decoder.onMalformedInput(CodingErrorAction.REPORT)
  .onUnmappableCharacter(CodingErrorAction.REPORT)

4. 結論

この記事では、Stringをバイト配列に変換して逆にする複数の方法を調査しました。入力データと無効な入力に必要な制御レベルに基づいて、適切な方法を選択する必要があります。

いつものように、完全なソースコードはover on GitHubにあります。

TOC