It would be nice to know the rough number of expected tokens in and out to estimate the potential cost if I were to run it myself, especially for the hard bench, as this seemingly isn't detailed in your paper.
In your paper you mention that the BigCodeBench-Complete has roughly 1112.5 chars per prompt, and 426 chars per answer for 1140 questions. This gives us enough to calculate very rough cost.
This roughly gives: 1140*1112.5 = 1,753,890 chars * 0.75 = 1,315,418 tokens for the complete
Or for instruct: 1140*(663.2+426) = 1,241,688 chars * 0.75 = 931,266 tokens
Could we potentially get similar figures for the hard benchmark dataset?
It would be nice to know the rough number of expected tokens in and out to estimate the potential cost if I were to run it myself, especially for the hard bench, as this seemingly isn't detailed in your paper.
In your paper you mention that the BigCodeBench-Complete has roughly 1112.5 chars per prompt, and 426 chars per answer for 1140 questions. This gives us enough to calculate very rough cost.
This roughly gives: 1140*1112.5 = 1,753,890 chars * 0.75 = 1,315,418 tokens for the complete
Or for instruct: 1140*(663.2+426) = 1,241,688 chars * 0.75 = 931,266 tokens
Could we potentially get similar figures for the hard benchmark dataset?