WIP: [PoC] Fix cobertura paths in the background job
This is the PoC work for #273540 (closed)
This is a proposal on how we could fix the cobertura report's file paths and also support multiple sources.
How it works
The parser will try to extract the right portion after the project's fullpath from each source
node.
For example, if the project full_path
is foo/bar
<sources>
<source>/builds/foo/bar/MyLib1</source>
<source>/builds/foo/bar/Some/foo/bar/here</source>
</sources>
We will extract ["MyLib1", "Some/foo/bar/here"]
and use these as basis to detemine the actual file path of each class relative to the project root, combining extracted source and the class filename.
Let's say we have a class with the filename=User.cs
, and the worktree paths look like ["MyLib1/User.cs", "Some/foo/bar/here/User.cs"]
, we will take the first path that matches which is Mylib1/User.cs
.
You may also notice that we have to delete
the matched class paths from the worktree context[:paths]
after we finish parsing a package's collection of classes. This is best-effort assumption to avoid conflicts with class filenames that have duplicates across multiple packages and sources.
I added specs to make sure the parser is properly handling the edge cases in this PoC.
I got the coverage MR diffs to work locally on a C# and Java project.
What about Go support?
Given the Go issue seems to require its own special parser logic, for now it will fallback to ignoring sources
and will require Go users to do the sed
workaround. I propose that we have a separate issue for handling Go cobertura format.
Some benchmark results
pipeline.all_worktree_paths
performance on GitLab project on staging
From testing on staging, loading all_worktree_paths
on a large project like ours seems to have no performance implications.
More info
pipeline = Ci::Pipeline.find 12878474
Benchmark.bmbm do |x|
x.report("fetch gitlab worktree") { pipeline.all_worktree_paths }
end
Rehearsal ---------------------------------------------------------
fetch gitlab worktree 0.525779 0.034583 0.560362 ( 1.108381)
------------------------------------------------ total: 0.560362sec
user system total real
fetch gitlab worktree 0.000119 0.000032 0.000151 ( 0.000132)
[ gstg ] production> s=pipeline.all_worktree_paths.to_set
[ gstg ] production> s.size
=> 39495
all_worktree_paths
on a project with 500_001
files on staging
Cost of calling
[ gstg ] production> Benchmark.bmbm {|b| b.report('500_001 files') { p.repository.ls_files('master') }; }
Rehearsal -------------------------------------------------
500_001 files 0.529791 0.124296 0.654087 ( 0.716400)
---------------------------------------- total: 0.654087sec
user system total real
500_001 files 0.470483 0.028320 0.498803 ( 0.630161)
[ gstg ] production> p.repository.ls_files('master').size
=> 500001
Comparing performance between old and new implementation
The added parser behavior in this proposal, which is cross checking between sources and worktree paths,
will only be triggered when there are any extracted parts after the first project.full_path
occurrence. So
for current cobertura reports that don't have any issues, for example, RSpec, there shouldn't be any significant
change on performance.
More info
xml_data = File.read("/Users/kratos/Documents/GitLab/test/cobertura/cobertura-coverage-rspec-gitlab.xml");
Benchmark.bmbm do |x|
x.report("old parser") { Gitlab::Ci::Parsers::Coverage::CoberturaOld.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new) }
x.report("new parser") { Gitlab::Ci::Parsers::Coverage::Cobertura.new.parse!(xml_data, Gitlab::Ci::Reports::CoverageReports.new, project_path: 'gitlab-org/gitlab', worktree_paths: []) }
end
Rehearsal ----------------------------------------------
old parser 5.591941 1.136074 6.728015 ( 6.756588)
new parser 5.537035 0.176511 5.713546 ( 5.718334)
------------------------------------ total: 12.441561sec
user system total real
old parser 5.999207 0.148476 6.147683 ( 6.149939)
new parser 6.331641 0.197576 6.529217 ( 6.537377)
A bit weird here. The there seems to be more time taken when parsing the same XML file on the 2nd run (new parser). Just to prove it's not because of the added code in the parser:
[7] pry(main)> Benchmark.bmbm {|x| x.report('old') { Hash.from_xml(xml_data)}; x.report('new') { Hash.from_xml(xml_data)}; }
Rehearsal ---------------------------------------
old 5.123571 0.145755 5.269326 ( 5.272794)
new 5.531147 0.186506 5.717653 ( 5.719541)
----------------------------- total: 10.986979sec
user system total real
old 5.048480 0.142026 5.190506 ( 5.192629)
new 6.089451 0.130987 6.220438 ( 6.222050)
The possible cause for concern in this implementation is when there are too many sources
, assuming all those sources have extractable parts, because we have to iterate through each source
for each class
node and cross-check the combined path to the given worktree paths.
Here's an example for a single source and 10_000
classes. As you will see, the performance difference between old and new is insignificant.
More info
Rehearsal ----------------------------------------------
old parser 0.574016 0.044187 0.618203 ( 0.619036)
new parser 0.566651 0.022968 0.589619 ( 0.589773)
------------------------------------- total: 1.207822sec
user system total real
old parser 0.382012 0.002439 0.384451 ( 0.384550)
new parser 0.458621 0.002361 0.460982 ( 0.461330)
Now here's an example for 100
extracted sources and 10_000
classes. This is where the parsing time is significantly slower with the new implementation. I am using 10_000
because this is close to the number of classes we currently have in our RSpec cobertura report which is around 8_000
.
More info
Rehearsal ----------------------------------------------
old parser 0.497342 0.006429 0.503771 ( 0.503547)
new parser 0.723245 0.011851 0.735096 ( 0.734848)
------------------------------------- total: 1.238867sec
user system total real
old parser 0.352771 0.000223 0.352994 ( 0.352829)
new parser 0.813122 0.018858 0.831980 ( 0.832343)
This is why in this PoC, I am proposing to add a MAX_SOURCES
limit on the number of extracted sources we can iterate through. Beyond the limit, if there is still no match for the given file path, the class will not be included in the report.
My final thoughts and opinions
My personal opinion about the sources
concern is that we may not actually realistically get projects with that huge amount of sources
, but having a limit on MAX_SOURCES
will at least help, if ever.
I think the parsing logic that we do here (comparing between extracted sources and worktree paths) will have to be done the same if ever we decide to go the CLI tool route. But I am worried it might be less performant with the CLI tool because then we will have to be actually checking if the file exists, which is slower than Set#include?
.
I also think that it's better that we put the burden of doing this extra work on our background jobs instead of consuming more user CI minutes.
Looking forward to hear your thoughts and suggestions.