Kafkaで処理した直後

Kafkaでの1回の正確な処理

1. 概要

このチュートリアルでは、Kafka ensures exactly-once delivery between producer and consumer applications through the newly introduced Transactional API.がどのように機能するかを見ていきます。

さらに、このAPIを使用してトランザクションのプロデューサーとコンシューマーを実装し、WordCountの例でエンドツーエンドの1回限りの配信を実現します。

2. カフカでのメッセージ配信

さまざまな障害が原因で、メッセージングシステムはプロデューサーアプリケーションとコンシューマーアプリケーション間のメッセージ配信を保証できません。クライアントアプリケーションがそのようなシステムと対話する方法に応じて、次のメッセージセマンティクスが可能です。

メッセージングシステムがメッセージを複製することは決してないが、時折メッセージを見逃す可能性がある場合、それをat-most-onceと呼びます。
または、メッセージを見逃すことはないが、時折メッセージが重複する可能性がある場合は、at-least-onceと呼びます。
ただし、常にすべてのメッセージを重複せずに配信する場合、それはexactly-onceです。

当初、Kafkaは最大1回と少なくとも1回のメッセージ配信のみをサポートしていました。

ただし、the introduction of Transactions between Kafka brokers and client applications ensures exactly-once delivery in Kafka。それをよりよく理解するために、トランザクションクライアントAPIを簡単に確認しましょう。

3. Mavenの依存関係

トランザクションAPIを使用するには、pomにKafka’s Java clientが必要です。


    org.apache.kafka
    kafka-clients
    2.0.0

**4. トランザクションconsume-transform-produceループ**

この例では、入力トピックsentencesからのメッセージを消費します。

次に、文ごとに、すべての単語をカウントし、個々の単語カウントを出力トピックcountsに送信します。

この例では、sentencesトピックで利用可能なトランザクションデータがすでに存在すると想定します。

4.1. トランザクション対応のプロデューサー

それでは、最初に典型的なカフカプロデューサーを追加しましょう。

Properties producerProps = new Properties();
producerProps.put("bootstrap.servers", "localhost:9092");

さらに、transactional.idを指定し、idempotenceを有効にする必要があります。

producerProps.put("enable.idempotence", "true");
producerProps.put("transactional.id", "prod-1");

KafkaProducer producer = new KafkaProducer(producerProps);

べき等を有効にしたため、KafkaはこのトランザクションIDをアルゴリズムの一部としてdeduplicate any message this producersendsに使用し、べき等性を確保します。

簡単に言えば、プロデューサーが誤って同じメッセージを複数回Kafkaに送信した場合、これらの設定により通知されます。

再起動しても一貫していますが、必要なのはmake sure the transaction id is distinct for each producerだけです。

4.2. トランザクションのプロデューサーの有効化

準備ができたら、initTransaction を呼び出して、プロデューサーがトランザクションを使用できるように準備する必要もあります。

producer.initTransactions();

これにより、プロデューサーがトランザクションidentifying it by its transactional.id and a sequence number, or epochを使用できるブローカーとして登録されます。次に、ブローカーはこれらを使用して、トランザクションログへのアクションを先書きします。

その結果、the broker will remove any actions from that log that belong to a producer with the same transaction id and earlierepoch, は、それらが無効なトランザクションからのものであると推定します。

4.3. トランザクションを意識した消費者

消費すると、トピックパーティション上のすべてのメッセージを順番に読み取ることができます。ただし、we can indicate with isolation.level that we should wait to read transactional messages until the associated transaction has been committed：

Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", "localhost:9092");
consumerProps.put("group.id", "my-group-id");
consumerProps.put("enable.auto.commit", "false");
consumerProps.put("isolation.level", "read_committed");
KafkaConsumer consumer = new KafkaConsumer<>(consumerProps);
consumer.subscribe(singleton(“sentences”));

read_committedの値を使用すると、トランザクションが完了する前にトランザクションメッセージを読み取らないようになります。

isolation.levelのデフォルト値はread_uncommitted.です

4.4. トランザクションによる消費と変換

プロデューサーとコンシューマーの両方をトランザクションで読み書きするように構成したので、入力トピックからレコードを消費し、各レコードの各単語をカウントできます。

ConsumerRecords records = consumer.poll(ofSeconds(60));
Map wordCountMap =
  records.records(new TopicPartition("input", 0))
    .stream()
    .flatMap(record -> Stream.of(record.value().split(" ")))
    .map(word -> Tuple.of(word, 1))
    .collect(Collectors.toMap(tuple ->
      tuple.getKey(), t1 -> t1.getValue(), (v1, v2) -> v1 + v2));

上記のコードについてトランザクション的なものは何もないことに注意してください。しかし、since we used read_committed, it means that no messages that were written to the input topic in the same transaction will be read by this consumer until they are all written.

これで、計算された単語数を出力トピックに送信できます。

トランザクションでも結果を生成する方法を見てみましょう。

4.5. APIを送信

カウントを新しいメッセージとして送信するために、同じトランザクションで、beginTransactionを呼び出します。

producer.beginTransaction();

次に、キーを単語、カウントを値として、それぞれを「カウント」トピックに書き込みます。

wordCountMap.forEach((key,value) ->
    producer.send(new ProducerRecord("counts",key,value.toString())));

プロデューサーはキーでデータをパーティション化できるため、これはtransactional messages can span multiple partitions, each being read by separate consumers.を意味することに注意してください。したがって、Kafkaブローカーはトランザクションのすべての更新されたパーティションのリストを保存します。

within a transaction, a producer can use multiple threads to send records in parallelにも注意してください。

4.6. オフセットのコミット

そして最後に、消費したばかりのオフセットをコミットする必要があります。 With transactions, we commit the offsets back to the input topic we read them from, like normal.ただし、we send them to the producer’s transaction.

このすべてを1回の呼び出しで実行できますが、最初に各トピックパーティションのオフセットを計算する必要があります。

Map offsetsToCommit = new HashMap<>();
for (TopicPartition partition : records.partitions()) {
    List> partitionedRecords = records.records(partition);
    long offset = partitionedRecords.get(partitionedRecords.size() - 1).offset();
    offsetsToCommit.put(partition, new OffsetAndMetadata(offset + 1));
}

トランザクションにコミットするのは次のオフセットであることに注意してください。つまり、1を追加する必要があります。

次に、計算されたオフセットをトランザクションに送信できます。

producer.sendOffsetsToTransaction(offsetsToCommit, "my-group-id");

4.7. トランザクションのコミットまたは中止

そして最後に、トランザクションをコミットできます。これにより、オフセットがconsumer_offsets topicとトランザクション自体にアトミックに書き込まれます。

producer.commitTransaction();

これにより、バッファされたメッセージがそれぞれのパーティションにフラッシュされます。さらに、Kafkaブローカーは、そのトランザクション内のすべてのメッセージを消費者が利用できるようにします。

もちろん、処理中に問題が発生した場合、たとえば例外が発生した場合は、abortTransaction:を呼び出すことができます。

try {
  // ... read from input topic
  // ... transform
  // ... write to output topic
  producer.commitTransaction();
} catch ( Exception e ) {
  producer.abortTransaction();
}

バッファされたメッセージをすべてドロップし、ブローカーからトランザクションを削除します。

If we neither commit nor abort before the broker-configured max.transaction.timeout.ms, the Kafka broker will abort the transaction itself. このプロパティのデフォルト値は900,000ミリ秒または15分です。

**5. その他のconsume-transform-produceループ**

今見たのは、同じKafkaクラスターに対して読み取りと書き込みを行う基本的なconsume-transform-produceループです。

逆に、applications that must read and write to different Kafka clusters must use the older commitSync and commitAsync API。通常、アプリケーションはトランザクションの状態を維持するために、消費者のオフセットを外部状態ストレージに保存します。

6. 結論

データクリティカルなアプリケーションでは、エンドツーエンドの1回限りの正確な処理が不可欠です。

このチュートリアルでは、we saw how we use Kafka to do exactly this, using transactionsと、トランザクションベースの単語カウントの例を実装して、原理を説明しました。

すべてのcode samples on GitHubを自由にチェックしてください。

TOC