Skip to content

Refactor embeddings

Alexandru Croitor requested to merge embeddings_refactoring into master

What does this MR do and why?

This refactoring is part of the effort to add VertexAi embeddings support to GitLab documentation embeddings, see Draft: [PoC] Create embeddings with vertex ai (!129864 - closed)

This is some light code refactoring which moves a couple classes to a bit more relevant modules since our experimentation and also prepares some of the utility classes to be used for other purposes. Eg creating embeddings for bugger chunks of text by sending the text size limits as parameters.

Main changes:

  • ee/lib/gitlab/llm/embeddings/utils/base_content_parser.rb
    • code was moved from ee/lib/gitlab/llm/content_parser.rb
    • class methods changed into instance methods
    • initializer added so that we can pass in different parameters for parsing and spliting content chunks for embedding
  • ee/lib/gitlab/llm/embeddings/utils/docs_content_parser.rb
    • added DocsContentParser as a specific version of content parser for documentation.
    • DocsContentParser follows the same limits on min and max characters as original ContentParser
  • Having the BaseContentParser should hopefully help out when we want to split issues content for embeddings for instance if we decide we want different limits for splitting content.

re https://gitlab.com/gitlab-org/gitlab/-/issues/420939

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Alexandru Croitor

Merge request reports

Loading