New Features: - Support output results to local files - New config field (`output_sinks`) to write results to multiple locations - Added ability to compare with ground truth with code generation datasets Optimization: - Reduce the number of text-embedding API calls by 50% in similarity score metric Misc: - Added test coverage report in CI - Better logging - Upgrade python to 3.11.8