fix: race condition that caused empty chat responses (!1730) · Merge requests · GitLab.org / gitlab-vscode-extension

Tomas Vik requested to merge tv/2024-07/fix-chat-race-condition into main Jul 08, 2024

Description

This is really not ideal. I wish I had more time. And I wish there wasn't rate limiting on the Chat API, then I could gain much more certainty when trying alternative fixes.

The problem

Issue: #1397 (closed)

We receive two types of messages through the GraphQl Subscription

chunks - these are small packets of text that, when added together, construct the stream of the resopnse
full message - this is one large message that contains the full response

The code is designed as if the full message always comes last, but that's not what's happening.

⚠ Sometimes we receive a chunk after the full message. ⚠

And that's the issue that this MR patches over (not fixes mind you).

The bandage 🌡

We remember when we received the full message and ignored all subsequent messages. This works but it's a fix of a symptom, not the root cause.

Better fix

A better fix would be disconnecting from the channel before making any async work would also fix the issue, but that's an assumption I wasn't able to verify because of the nature of this bug. So I made a more explicit fix

diff --git a/src/common/chat/gitlab_chat_api.ts b/src/common/chat/gitlab_chat_api.ts
index b829c23c..c89799c7 100644
--- a/src/common/chat/gitlab_chat_api.ts
+++ b/src/common/chat/gitlab_chat_api.ts
@@ -177,11 +177,10 @@ export class GitLabChatApi {
 
     channel.on('newChunk', messageCallback);
     channel.on('fullMessage', async message => {
-      await messageCallback(message);
-
       if (subscriptionId) {
         cable.disconnect();
       }
+      await messageCallback(message);
     });
 
     cable.subscribe(channel);

The real fix

The monolith should ensure that the full message is the last message for the given subscription ID.

Related Issues

Resolves #1397 (closed)

How has this been tested?

The way to reproduce the bug is tricky:

First apply this patch

diff --git a/src/common/chat/gitlab_chat_api.ts b/src/common/chat/gitlab_chat_api.ts
index b829c23c..d9550f42 100644
--- a/src/common/chat/gitlab_chat_api.ts
+++ b/src/common/chat/gitlab_chat_api.ts
@@ -9,6 +9,7 @@ import {
   AiCompletionResponseMessageType,
 } from '../api/graphql/ai_completion_response_channel';
 import { extractUserId } from '../platform/gitlab_account';
+import { log } from '../log';
 
 export const AI_ACTIONS = {
   chat: gql`
@@ -175,8 +176,12 @@ export class GitLabChatApi {
 
     const cable = await platform.connectToCable();
 
-    channel.on('newChunk', messageCallback);
+    channel.on('newChunk', async msg => {
+      log.info(`CHAT-DEBUG: new chunk - ${msg.content}`);
+      await messageCallback(msg);
+    });
     channel.on('fullMessage', async message => {
+      log.info(`CHAT-DEBUG: full message - ${message.content}`);
       await messageCallback(message);
 
       if (subscriptionId) {
diff --git a/src/common/chat/gitlab_chat_controller.ts b/src/common/chat/gitlab_chat_controller.ts
index 54df1711..d70e1865 100644
--- a/src/common/chat/gitlab_chat_controller.ts
+++ b/src/common/chat/gitlab_chat_controller.ts
@@ -66,6 +66,12 @@ export class GitLabChatController implements vscode.WebviewViewProvider {
             // messages yet. Will be handled aas part of https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/issues/1065
             // Hence for now we handle it on the client
             await this.#view.cleanChat();
+            const record = GitLabChatRecord.buildWithContext({
+              role: 'user',
+              content: 'how do I install an npm dependency',
+            });
+
+            await this.processNewUserRecord(record);
           }
         } catch (err) {
           log.error(err.toString());

Then open the development extension host and run /clear command until you either see the empty response (chunk overrode the message) or you run out of your daily chat quota and you get rate limited

Once you reproduce the bug, you know same as I do that the order of the "chunk" vs "fullMessage" is the culprit. Then you can simulate the same situation with adding timeout to the chunk processing. Apply this patch on top of this MR and see the new condition ignoring the delayed chunks:

diff --git a/src/common/chat/gitlab_chat_api.ts b/src/common/chat/gitlab_chat_api.ts
index 0157ffd0..bdb1a029 100644
--- a/src/common/chat/gitlab_chat_api.ts
+++ b/src/common/chat/gitlab_chat_api.ts
@@ -10,6 +10,7 @@ import {
 } from '../api/graphql/ai_completion_response_channel';
 import { extractUserId } from '../platform/gitlab_account';
 import { log } from '../log';
+import { waitForMs } from '../utils/wait_for_ms';
 
 export const AI_ACTIONS = {
   chat: gql`
@@ -181,6 +182,7 @@ export class GitLabChatApi {
     let fullMessageReceived = false;
 
     channel.on('newChunk', async msg => {
+      await waitForMs(300);
       if (fullMessageReceived) {
         log.info(`CHAT-DEBUG: full message received, ignoring chunk`);
         return;

and see the extension logs:

2024-07-08T14:15:55:613 [info]: CHAT-DEBUG: full message received, ignoring chunk
2024-07-08T14:15:55:613 [info]: CHAT-DEBUG: full message received, ignoring chunk

Screenshots (if appropriate)

What CHANGELOG entry will this MR create?

fix: Bug fix fixes - a user-facing issue in production - included in changelog
feature: New feature - a user-facing change which adds functionality - included in changelog
BREAKING CHANGE: (fix or feature that would cause existing functionality to change) - should bump major version, mentioned in the changelog
None - other non-user-facing changes

Edited Jul 08, 2024 by Tomas Vik

fix: race condition that caused empty chat responses