My team at Scott Logic wanted to see how well large language models (LLMs) could perform on a fully automated coding task. We decided to try using an LLM to generate JavaScript code displaying a chart in a web page using the D3 library.

There are many examples of D3 code available on the internet so we expected LLMs had been trained with a lot of this data and might be able to generate correct code for D3 charts as a result. But the task we had in mind was more complex than the short, well-defined and self-contained coding problems that LLMs are normally evaluated against. No one on the team had much experience with LLMs, but through experimentation we soon learned more about what to expect. We found that LLMs could often succeed in the task, but only if we provided them with the right kind of help.

The chart generation system

Our aim was to use an LLM to produce a D3 chart based on a “user request” explaining what information the chart should display. We wanted the generated code to run automatically within a simple web page, so we needed the code to be right the first time without additional help from the user.

The system we set up for generating and displaying D3 charts uses Python to interact with the LLMs we use, and to create a JavaScript file from the output of the LLM. The resulting JavaScript file is dropped into a simple website, with a single page that loads the JavaScript file and displays the chart it produces.

A variety of Python libraries are available for interacting with LLMs. OpenAI, Hugging Face and GPT4All each have their own libraries designed to work with the models they provide. We used LangChain to give a common interface to all those underlying libraries, allowing us to use a lot of the same code with a wide variety of LLMs. The page loads the latest version of the D3 library, which defines a global variable d3. Then the page executes the JavaScript file containing the code generated by the LLM. If all goes well, the generated code displays a chart.

Initial results

Of the LLMs we tried, only three (all from OpenAI) gave us useful results: text-davinci-003, gpt-3.5-turbo and gpt-4. Some other LLMs we tried were text-curie-001 from OpenAI, StarCoder from BigCode, replit-code-v1-3b from Replit and Nous-Hermes-13b from Nous Research.1

To get a basic idea of the capabilities of different LLMs, let’s look at what they output for this simple prompt:

Write JavaScript code that uses the d3 library to produce a bar chart showing the number of
races won per driver in the 2022 F1 season.

text-davinci-003

text-davinci-003 is part of OpenAI’s GPT-3 model family. It has been fine-tuned on an instruction-following dataset. It was trained on data up to June 2021.

attempt 1

On the first attempt with our simple prompt, text-davinci-003 generated this output:

// Data array for the 2022 F1 season
var data = [
  { driver: "Lewis Hamilton", racesWon: 12 },
  { driver: "Max Verstappen", racesWon: 8 },
  { driver: "Valtteri Bottas", racesWon: 4 },
  { driver: "Charles Leclerc", racesWon: 3 },
  { driver: "Sebastian Vettel", racesWon: 2 },
];

// Set the dimensions of the canvas / graph
var margin = { top: 30, right: 20, bottom: 30, left: 50 },
  width = 600 - margin.left - margin.right,
  height = 300 - margin.top - margin.bottom;

// Set the ranges
var x = d3.scaleBand().range([0, width]).padding(0.1);
var y = d3.scaleLinear().range([height, 0]);

// Append the SVG object to the body of the page
// Append a 'group' element to 'svg'
// Moves the 'group' element to the top left margin
var svg = d3
  .select("body")
  .append("svg")
  .attr("width", width + margin.left + margin.right)
  .attr("height", height + margin.top + margin.bottom)
  .append("g")
  .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

// Scale the range of the data in the domains
x.domain(
  data.map(function (d) {
    return d.driver;
  }),
);
y.domain([
  0,
  d3.max(data, function (d) {
    return d.racesWon;
  }),
]);

// Append the rectangles for the bar chart
svg
  .selectAll(".bar")
  .data(data)
  .enter()
  .append("rect")
  .attr("class", "bar")
  .attr("x", function (d) {
    return x(d.driver);
  })
  .attr("width", x.bandwidth())
  .attr("y", function (d) {
    return y(d.racesWon);
  })
  .attr("height", function (d) {
    return height - y(d.racesWon);
  });

// Add the x Axis
svg
  .append("g")
  .attr("transform", "translate(0," + height + ")")
  .call(d3.axisBottom(x));

// Add the y Axis
svg.append("g").call(d3.axisLeft(y));

The code sets up some fake but realistic data, adds a new <svg> element to the page DOM and correctly renders a bar chart within the new element.

When executed in our chart web page, the code produces this chart:
bar chart with five bars, the largest of which is for "Lewis Hamilton" reaching a value of 12

This was the best chart produced by any model in a few attempts with our simple prompt. However, text-davinci-003 didn’t always do this well.

attempt 2

The code from the second attempt calls a D3 function to load a CSV file that doesn’t exist, and so it fails with a runtime error.

// Load data
  d3.csv("2022F1SeasonData.csv", function(data) {
~~~ etc.

attempt 3

The third attempt generated code that sets up fake data, as in attempt 1. However, this time the chart it produces doesn’t have an x-axis or y-axis because they have been cropped out of the visible frame. Unlike the first two examples from text-davinci-003, the generated code in this example doesn’t establish a margin around the main part of the chart. bar chart with four bars with no space between them, no labels of any kind, and no line for the x-axis

gpt-4

Another LLM that gives some good results with our simple prompt is OpenAI’s latest model, gpt-4.

attempt 1

On the first attempt gpt-4 generated the following output in response to our simple prompt above:

Here's a basic example of how you could create a bar chart using D3.js. Please note that this
is a simplified example and does not include all the steps you might need to take in a
real-world application, such as fetching the data from an API or handling errors.

```javascript
// Assuming we have data in the following format
const data = [
  { driver: 'Driver 1', racesWon: 5 },
  { driver: 'Driver 2', racesWon: 3 },
  { driver: 'Driver 3', racesWon: 7 },
  // ... more drivers
];

// Set the dimensions and margins of the graph
const margin = { top: 20, right: 20, bottom: 30, left: 40 },
  width = 960 - margin.left - margin.right,
  height = 500 - margin.top - margin.bottom;

// Set the ranges
const x = d3.scaleBand()
  .range([0, width])
  .padding(0.1);
const y = d3.scaleLinear()
  .range([height, 0]);

// Append the svg object to the body of the page
const svg = d3.select("body").append("svg")
  .attr("width", width + margin.left + margin.right)
  .attr("height", height + margin.top + margin.bottom)
  .append("g")
  .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

// Scale the range of the data in the domains
x.domain(data.map(function (d) { return d.driver; }));
y.domain([0, d3.max(data, function (d) { return d.racesWon; })]);

// Append the rectangles for the bar chart
svg.selectAll(".bar")
  .data(data)
  .enter().append("rect")
  .attr("class", "bar")
  .attr("x", function (d) { return x(d.driver); })
  .attr("width", x.bandwidth())
  .attr("y", function (d) { return y(d.racesWon); })
  .attr("height", function (d) { return height - y(d.racesWon); });

// Add the x Axis
svg.append("g")
  .attr("transform", "translate(0," + height + ")")
  .call(d3.axisBottom(x));

// Add the y Axis
svg.append("g")
  .call(d3.axisLeft(y));
```

In this code, we first define some data, then set up the dimensions and margins for the graph.
We create scales for the x and y axes, then append a new SVG element to the body of the page.

We then scale the range of the data in the domains and append rectangles for each data point to
create the bar chart.

Finally, we add the x and y axes to the SVG.

Please note that you would need to include D3.js library in your HTML file to run this script.

This output is similar what we saw in the first text-davinci-003 attempt above, but there is some natural language commentary before and after the generated code. If the JavaScript code is extracted from the output (which is easily done because it is wrapped in a markdown code block), this is the chart it displays in our chart web page: bar chart with three bars labeled "Driver 1", "Driver 2" and "Driver 3", and with value labels on the y-axis ranging from "0.0" to "7.0" in increments of 0.5

The data is displayed correctly, but the values on the y-axis are showing every value at 0.5 intervals when the number of races won can only be a whole number.

attempt 2

Another attempt with gpt-4 gave essentially the same code as the first example, with some variation in the natural language commentary. The output starts like this:

I'm sorry, but as of now, I can't provide the accurate JavaScript code to produce a bar chart
showing the number of races won per driver in the 2022 F1 season. The reason being, the 2022 F1
season is still ongoing and data regarding the races won by each driver is not yet complete or
available.

However, I can provide a general idea of how you can create a bar chart using d3.js. You'll
need to replace the data with the actual data when it becomes available.

```javascript
// sample data, replace with actual data when available
var data = [
  {driver: 'Driver 1', racesWon: 5},
  {driver: 'Driver 2', racesWon: 3},
  {driver: 'Driver 3', racesWon: 6},
  // ...
];
~~~ etc.

attempt 3

The code generated in the third attempt produces a chart with the x-axis and y-axis cropped out of the displayed frame, like in the third attempt with text-davinci-003. The cause is the same – no margin is set up around the main part of the chart.

gpt-3.5-turbo

OpenAI’s gpt-3.5-turbo model, from OpenAI’s GPT 3.5 model family, gives results similar to those from gpt-4.

attempt 1

On the first attempt, gpt-3.5-turbo generated a response with natural language commentary and a JavaScript code block. The JavaScript code sets up hard-coded data and displays a chart which is missing x and y axes because they are rendered outside the displayed area of the <svg> element. Once again the code does not establish a margin around the main body of the chart. The displayed chart also has a problem where the labels for the bars overlap under the x-axis, making them mostly illegible. bar chart with five bars with labels that overlap each other on the x-axis, and with no axis lines or other labels

attempt 2

The second attempt with gpt-3.5-turbo again generated a response with natural language commentary and a JavaScript code block. The JavaScript produces a runtime error trying to fetch a non-existent data file from the web.

attempt 3

A final attempt with gpt-3.5-turbo generated a response with natural language commentary and HTML which includes JavaScript in a <script> element. This is not the requested format, but if the HTML with its embedded JavaScript is loaded, it displays a good bar chart: bar chart with four bars labeled with driver names on the x-axis, and with clear labels for values on the y-axis

However, within our automated system that tries to use the output in a simple web page, the JavaScript produces a runtime error because the web page does not have the HTML element the code is expecting.

Other models

Other LLMs we tried didn’t generate useful JavaScript code. Some generated JavaScript with basic syntax errors or with syntax that is only valid in other programming languages. In other cases the generated JavaScript was syntactically valid but didn’t do anything useful. text-curie-001 generated JavaScript that produces runtime errors because it calls functions that don’t exist or because it passes invalid arguments to D3 functions. StarCoder generated Python code rather than the JavaScript that was requested. Although the Python code wasn’t tested it looked as though it might produce a chart. Often the less successful LLMs started repeating the same line or group of lines and continue repeating them to the end of the response. Nous-Hermes-13b responded with a natural language description of some data that might be used in generating the requested chart, but without any JavaScript code.

Providing the CSV data

Clearly, a chart that displays fake data isn’t going to be very useful in a real-world system. We decided to provide relevant data in comma-separated value (CSV) format, which the generated JavaScript could load from the web server. We can modify our prompt to give the LLM information needed to generate JavaScript that makes use of a CSV file. The new prompt includes the location of the file and the names of the data fields it contains:

Write JavaScript code that uses the d3 library to produce a bar chart showing the number of
races won per driver in the 2022 F1 season.

Load the data from a csv file located at 'data-sources/2022_f1.csv'.

The columns in the csv file are: Circuit,Date,1st Place Driver,1st Place Team,2nd Place Driver,
2nd Place Team,3rd Place Driver,3rd Place Team,Laps,Time,Fastest Stop Team,Fastest Stop Driver,
Fastest Stop Time.

gpt-3.5-turbo

Out of five attempts with gpt-3.5-turbo using the updated prompt, three successfully displayed a chart in the web page. The best chart was this one: bar chart with the x-axis labeled "Driver" and the y-axis labeled "Number of Wins", containing five blue bars labeled with driver names, and with clear labels for values on the y-axis

The gpt-3.5-turbo response that produced the chart starts like this:

Here is the JavaScript code that uses the d3 library to produce a bar chart showing the number
of races won per driver in the 2022 F1 season:

```javascript
// Load the data from the CSV file
d3.csv('data-sources/2022_f1.csv').then(function(data) {
  // Count the number of races won per driver
  var driverWins = {};
  data.forEach(function(d) {
    var winner = d['1st Place Driver'];
    if (driverWins[winner]) {
      driverWins[winner]++;
    } else {
      driverWins[winner] = 1;
    }
  });
~~~ etc.

On this attempt, the model correctly uses the D3 csv function to load the data. It also correctly generates the code needed to count the number of times each driver appears in the 1st Place Driver field. The striking thing about this is that the model has understood the meaning and structure of the data given only the names of the fields. With the first prompt we used, the models set up fake data with fields like driver and racesWon. That data structure doesn’t need any further data transformation. But when the prompt includes the information that each data record contains a field 1st Place Driver, the model “understands” that the code needs to count the number of times each driver appears in that field to get the number of races won. This attempt also produces a chart with labels explaining the meaning of the two axes (“Driver” and “Number of Wins”). It was the only attempt with either of our first two prompts that includes this information.

Two attempts failed because the generated code calls functions that are only available in earlier versions of the D3 library. One displayed a correct chart when D3 v4 was loaded instead of D3 v7. The other failed attempt had code that calls two D3 functions that are present in different D3 versions, but are not both present in any single version. These failures show that the model has a poor understanding of the differences between versions of D3. This might be expected since code written for different versions of D3 is very similar and most examples the model is trained on probably do not explicitly declare which version of D3 they are using.

gpt-4

Two out of five attempts with gpt-4 were successful. Two others displayed the correct data but had axes or labels that were rendered outside of the visible area of the <svg> element. The one failed attempt generated code that tries to render the chart into an HTML element that does not exist in the chart web page. If a matching element is added to the chart web page HTML, the generated code displays a correct chart.

text-davinci-003

All five attempts with text-davinci-003 failed, each with one or two problems. The problems in the generated code were caused by: - Passing an argument to the D3 csv function that is only valid for an earlier version of the library. This occurred in three attempts, and in each case the code worked correctly when D3 v4 was loaded instead of D3 v7. - Rendering the chart in a non-existent HTML element. This occurred in two attempts. - Including a <script> tag within the JavaScript code. - Axes not aligned correctly with the main body of the graph.

Again, we see the problem with understanding which version of D3 to use, which is unsurprising since there is nothing in the prompt to indicate which version should be used. Some quick experiments with adding an instruction to the prompt like “Use D3 version 7” did not make a noticeable improvement, so the problem might be with the training data as much as with the prompt.

The problem with selecting a non-existent HTML element is also unsurprising since the prompt contains no information about the HTML of the page. Experiments showed that adding an instruction like “Select the existing <svg> element of the HTML page for rendering” eliminates this problem for gpt-4, gpt-3.5-turbo and text-davinci-003.

One-shot prompting

One-shot prompting is a well known technique used to help guide an LLM in producing output that matches user expectations. The basic idea is to provide an example instruction and response within the prompt. The new instruction is then provided and the LLM is left to complete the text by generating a matching response. Typically, the LLM will generate a response in a style similar to the example response in the prompt. Here is a one-shot prompt for our chart generation task which includes an example with labels for “INSTRUCTION”, “DATA” and “RESPONSE” followed by the “INSTRUCTION” and “DATA” sections for the actual user request we are interested in, and a final “RESPONSE” label to show where the LLM should complete the text:

### INSTRUCTION
Write JavaScript code that uses the d3 library to produce a bar chart showing the number of
pupils in each class.

### DATA
Data is in a csv file at 'data-sources/classes.csv'.
The fields in the csv file are: Class,Pupils.
The first three data rows of the csv file are: ['Biology,23','History,28','Latin,5'].

### RESPONSE
d3.csv('data-sources/pupils.csv').then((data) => {
  var svg = d3.select('svg');
  var width = 800;
  var height = 600;
  var margin = {top: 20, right: 20, bottom: 60, left: 80};
  svg.attr("width", width).attr("height", height);
  var xScale = d3.scaleBand().range([margin.left, width - margin.right]).padding(0.1);
  var yScale = d3.scaleLinear().range([height - margin.bottom, margin.top]);

  // find the number of pupils in each class
  // transform or aggregate the data, if necessary
  xScale.domain(data.map((d) => d["Class"]));
  yScale.domain([0, d3.max(data, (d) => d["Pupils"])]);
  svg.selectAll(".bar")
    .data(data)
    .enter()
    .append("rect")
    .attr("class", "bar")
    .attr("x", (d) => xScale(d[0]))
    .attr("y", (d) => yScale(d[1]))
    .attr("width", xScale.bandwidth())
    .attr("height", (d) => height - margin.bottom - yScale(d[1]));
  svg.append("g")
    .attr("transform", "translate(0," + (height - margin.bottom) + ")")
    .call(d3.axisBottom(xScale));
  svg.append("g")
    .attr("transform", "translate(" + margin.left + ",0)")
    .call(d3.axisLeft(yScale));
  svg.append('text')
    .attr('transform', 'translate(' + (width / 2) + ',' + (height - margin.bottom + 50) + ')')
    .style('text-anchor', 'middle')
    .text('Class');
  svg.append('text')
    .attr('transform', 'rotate(-90)')
    .attr('y', 0)
    .attr('x', 0 - (height / 2))
    .attr('dy', '1em')
    .style('text-anchor', 'middle')
    .text('Number of Pupils');
});

### INSTRUCTION:
Write JavaScript code that uses the d3 library to produce a bar chart showing the number of
races won per driver in the 2022 F1 season.

### DATA:
Data is in a csv file at 'data-sources/2022_f1.csv'.
The fields in the csv file are: Circuit,Date,1st Place Driver,2nd Place Driver,
3rd Place Driver,Laps.
The first three data rows of the csv file are: ['Bahrain,20-Mar-22,Charles Leclerc,
Carlos Sainz,Lewis Hamilton,57', 'Saudi Arabia,27-Mar-22,Max Verstappen,Charles Leclerc,
Carlos Sainz,50','Australia,10-Apr-22,Charles Leclerc,Sergio Perez,George Russell,58'].

### RESPONSE:

The one-shot example in this prompt uses a simple request and a simple data structure. We have seen already that the more capable LLMs have the ability to generate code to transform the data to a structure suitable for the request. So the focus of the example in this prompt is on the aspects that relate to the execution environment, like using the correct form of the csv function and rendering the chart in the expected place within the DOM, and also on aspects of the visual presentation of the chart, like using a margin so that the axes are not displayed outside the visible area and adding text explaining the meaning of the axes.

Code comments

An earlier version of our one-shot prompt had an example response without any comments before the call to xScale.domain(). With that prompt, the generated code was often failing to transform or aggregate the data correctly. These failures occurred more often than they had in prompts without an example response. It seemed that the LLMs were not able to focus on the user request as well in the one-shot prompt. Results improved when a code comment containing the user request was added at the point in the example response where the data might need to be reshaped, as seen in the example above.

text-davinci-003

Five out of ten attempts using text-davinci-003 with our one-shot prompt produced correct bar charts, complete with labels explaining the meaning of the axes. The chart produced for these five attempts is essentially the same every time and looks like this: bar chart with the x-axis labeled "Driver" and the y-axis labeled "Number of Wins", containing five black bars labeled with driver names, and with clear labels for values on the y-axis

Four attempts produced charts that show the number of podium finishes (first, second and third place finishes) instead of just wins, and one attempt produced a chart with no bars because it fails to count the number of wins for each driver.

Many of the problems seen with the earlier prompts are resolved with our one-shot prompt, but failures due to incorrect data handling are more common. This was true with gpt-3.5-turbo and gpt-4 as well.

gpt-3.5-turbo

  • 2 correct charts (with labels explaining axes)
  • 8 with incorrect data
    • 3 showing podium finishes instead of wins
    • 4 not counting wins per driver (no bars displayed)
    • 1 showing 1 win for any driver with a win (no matter how many races the driver won)

gpt-4

  • 8 correct charts (with labels explaining axes)
  • 2 runtime errors - trying to use d3.nest() which requires an earlier version of D3

When D3 v5 is loaded instead of D3 v7, both instances of code using d3.nest() produce a correct chart.

Can AI write software?

Our experiments showed that LLMs can generate code to solve a moderately complex problem, but that getting decent results requires giving the LLM a lot of help to understand the target execution context. We were able to give that help using a one-shot prompting technique, but even with our help none of the LLMs produced correct code more than 80% of the time.

LLMs and more specialised machine learning models will undoubtedly continue to improve in their ability to generate correct code. I expect a steady increase in the number of scenarios where AI can be used to automatically generate code in the real world, but I also expect that for a long time to come developing robust software with AI code contributions will require a lot of human engineering as well.

Acknowledgements

This blog post is based on the work of the entire project team at Scott Logic so thanks are due to Ben Pearson, Jodie Laurenson, Husain Mehdi, Andrew Brown and Steve Conway.

Thanks also to Doğan Yiğit Yenigün for collecting and publishing F1 race data used in our work.


  1. Work on this project was completed in October 2023, and we only have results from LLMs available to us at that time.