Dynamic Model Fan Out

Problem to solve

Right now we run out pipeline using apache beam on top of dataflow and we call all our models via 1 step (see the step labeled "Request Vertex Completions" in figure 1) which doesn't allow us to take advantage of some of the parallelisms that beam offers.

graph TD
subgraph Figure 1 Current State
    readCode(Read Code) 
    filterCode(Filter Code)
    chunkCode(Chunk Code)
    applyPromptTransformations(Apply Prompt Transformations)
    batchCodeChunks(Batch Code Chunks)
    throttleCompletion(Throttle Completions API Call)
    reqeustVertexCompletion(Request Vertex Completions)
    batchCompleitions(Batch Code Completions)
    throttleEmbedding(Throttle Embedding API Call)
    requestEmbeddingSimilarity(Compute Similarity)
    postProcessCompletions(Post Process Completions)
    sendDataToBigQuery(Send Data to BigQuery)

    
    readCode
    --> filterCode
    --> chunkCode
    --> applyPromptTransformations
    --> batchCodeChunks 
    --> throttleCompletion
    --> reqeustVertexCompletion
    --> postProcessCompletions
    --> batchCompleitions
    --> throttleEmbedding
    --> requestEmbeddingSimilarity
    --> sendDataToBigQuery
end

While adding Huggingface models a method of separating the Huggingface calls from the vertex calls was derived (see figure 2). But this doesn't really leverage the parallelism gains possible that are offered by beam and doesn't meet the goal of having models from different platforms be equal citizens within the prompt-library. As such this proposal was drafted.

graph TD
subgraph Figure 2 Huggingface and Vertex Split
    readCode(Read Code) 
    filterCode(Filter Code)
    chunkCode(Chunk Code)
    applyPromptTransformations(Apply Prompt Transformations)
    batchCodeChunks(Batch Code Chunks)
    throttleCompletion(Throttle Completions API Call)
    reqeustVertexCompletion(Request Vertex Completions)
    reqeustHuggingFaceCompletion(Request Vertex Completions)
    batchCompleitions(Batch Code Completions)
    throttleEmbedding(Throttle Embedding API Call)
    requestEmbeddingSimilarity(Compute Similarity)
    mergeResults(Merge Completions)
    postProcessCompletions(Post Process Completions)
    sendDataToBigQuery(Send Data to BigQuery)

    
    readCode
    --> filterCode
    --> chunkCode
    --> applyPromptTransformations
    --> batchCodeChunks 
    --> throttleCompletion
    --> reqeustVertexCompletion

    throttleCompletion
    --> reqeustHuggingFaceCompletion

    reqeustHuggingFaceCompletion --> mergeResults

    reqeustVertexCompletion --> 
    mergeResults

    mergeResults
    --> postProcessCompletions
    --> batchCompleitions
    --> throttleEmbedding
    --> requestEmbeddingSimilarity
    --> sendDataToBigQuery
end

Proposal

It was proposed by @HongtaoYang and @tle_gitlab and I agree that we should be fanning those calls out for all the models so instead of calling all the models in one step or dividing the calling steps by platform we call each model in it's own parallel step. This will align with the work being done by @HongtaoYang to make vertex, hugging face, and anthropic models equal citizens in our architecture, will allow for faster processing times (read improved scale), and when the two initiatives work together make adding models faster in the future. For a visual of this see figure 3 below.

graph TD
subgraph Figure 3 Proposed Idea
    readCode(Read Code)
    filterCode(Filter Code)
    chunkCode(Chunk Code)
    applyPromptTransformations(Apply Prompt Transformations)
    
    code-gecko_batchCodeChunks
    code-gecko_throttleCompletion
    code-gecko_requestCompletion
    text-bison_batchCodeChunks
    text-bison_throttleCompletion
    text-bison_requestCompletion
    CodeLlama13b_batchCodeChunks
    CodeLlama13b_throttleCompletion
    CodeLlama13b_requestCompletion
    Phind-CodeLlama-34B-v2_batchCodeChunks
    Phind-CodeLlama-34B-v2_throttleCompletion
    Phind-CodeLlama-34B-v2_requestCompletion

    batchCompleitions(Batch Code Completions)
    throttleEmbedding(Throttle Embedding API Call)
    requestEmbeddingSimilarity(Compute Similarity)
    mergeResults(Merge Completions)
    postProcessCompletions(Post Process Completions)
    sendDataToBigQuery(Send Data to BigQuery)

    readCode
    --> filterCode
    --> chunkCode
    --> applyPromptTransformations

    applyPromptTransformations --> code-gecko_batchCodeChunks --> code-gecko_throttleCompletion --> code-gecko_requestCompletion --> mergeResults
    applyPromptTransformations --> text-bison_batchCodeChunks --> text-bison_throttleCompletion --> text-bison_requestCompletion --> mergeResults
    applyPromptTransformations --> CodeLlama13b_batchCodeChunks --> CodeLlama13b_throttleCompletion --> CodeLlama13b_requestCompletion --> mergeResults
    applyPromptTransformations --> Phind-CodeLlama-34B-v2_batchCodeChunks --> Phind-CodeLlama-34B-v2_throttleCompletion --> Phind-CodeLlama-34B-v2_requestCompletion --> mergeResults
    
    mergeResults
    --> postProcessCompletions
    --> batchCompleitions
    --> throttleEmbedding
    --> requestEmbeddingSimilarity
    --> sendDataToBigQuery
end

Further details

Links / references

Edited Sep 19, 2023 by Stephan Rayner