Now Reading
Working Python micro-benchmarks utilizing the ChatGPT Code Interpreter alpha

Working Python micro-benchmarks utilizing the ChatGPT Code Interpreter alpha

2023-04-11 21:20:49

Working Python micro-benchmarks utilizing the ChatGPT Code Interpreter alpha

In the present day I wished to know the efficiency distinction between two Python implementations of a mechanism to detect modifications to a SQLite database schema. I rendered the distinction between the 2 as this chart:

Benchmark Results Over Time - two lines, one for PRAGMA schema_version and one for hashlib.md5. The axis are time in seconds and number of tables - the hashlib.md5 line climes from around 0.000075 to 0.000150 after 200 tables have been created, while the PRAGMA schema_version line stays almost flat at 0.000050 at the bottom of the chart. It's a good chart: well labelled, easy to read.

From begin to end, this whole benchmarking train took me lower than 5 minutes—as a result of ChatGPT did nearly all the work for me.

ChatGPT Code Interpreter alpha

I ran the benchmark utilizing the brand new ChatGPT “Code Interpreter” alpha, which I lately gained entry to, presumably as a result of being within the alpha for ChatGPT Plugins.

Code Interpreter mode gives ChatGPT with a single extra instrument: it will possibly now generate Python code and execute it in a restricted sandbox. The sandbox doesn’t have community entry and solely has entry to a set of pre-approved libraries. A type of libraries is matplotlib.pyplot, and it is aware of the right way to each render and show charts utilizing it.

Right here’s the preliminary immediate I despatched it:

In Python create an in-memory SQLite database with 100 tables every with 10 columns.

Time how lengthy it takes to execute PRAGMA schema_version towards that database 100 instances.

Then add one other 100 tables and time PRAGMA schema_version 100 instances once more.

Now run the entire benchmark a second time, however as a substitute of PRAGMA schema_version time how lengthy it takes to run hashlib.md5(db.execute(“choose group_concat(sql) from sqlite_master”).fetchall()[0]).hexdigest() as a substitute

Background—why examine these two issues?

A little bit of background on this. Datasette must detect any time a SQLite schema modifications—which might occur if a desk is added or altered ultimately. It must know so it will possibly replace its personal in-memory cache of the schema, with the intention to (finally) energy quick desk search throughout all recognized databases.

SQLite has a neat mechanism for doing this: you may run PRAGMA schema_version and it’ll return a quantity that increments each time the schema modifications. This can be a very quick operation.

I’m at the moment investigating a confounding bug the place a person has reported the error message “try to jot down a readonly database” when this PRAGMA schema_version learn operation is executed.

I’m nonetheless investigating, however this might imply that I want to search out another method of detecting modifications to the schema.

One possibility is to calculate an MD5 hash of the complete schema of the database, utilizing the next:

schema = db.execute(
    "choose group_concat(sql) from sqlite_master"
).fetchall()[0]
hash = hashlib.md5(schema).hexdigest()

I can then examine that hash to the earlier hash to see if the schema has modified.

The query I wished to reply was how a lot of a efficiency hit this may characterize—particularly for databases with massive variety of tables.

Therefore my need to run a fast micro-benchmark.

ChatGPT’s preliminary benchmark

ChatGPT began by writing the next:

import sqlite3
import time
import hashlib

# Outline a perform to create tables and run benchmarks
def run_benchmark(db, num_tables, num_columns, benchmark_query):
    # Begin timing
    start_time = time.time()
    
    # Create tables
    for i in vary(num_tables):
        columns = ', '.be part of([f"col{j} INTEGER" for j in range(num_columns)])
        db.execute(f"CREATE TABLE desk{i} ({columns})")
    
    # Benchmark
    for i in vary(100):
        if benchmark_query == "PRAGMA schema_version":
            db.execute("PRAGMA schema_version")
        else:
            hashlib.md5(db.execute("SELECT group_concat(sql) FROM sqlite_master").fetchall()[0][0].encode()).hexdigest()
    
    # Calculate time taken
    elapsed_time = time.time() - start_time
    return elapsed_time

# Initialize in-memory SQLite database
db = sqlite3.join(":reminiscence:")

# Run the primary benchmark
benchmark1_time1 = run_benchmark(db, num_tables=100, num_columns=10, benchmark_query="PRAGMA schema_version")
benchmark1_time2 = run_benchmark(db, num_tables=100, num_columns=10, benchmark_query="PRAGMA schema_version")

# Run the second benchmark
db2 = sqlite3.join(":reminiscence:")
benchmark2_time1 = run_benchmark(db2, num_tables=100, num_columns=10, benchmark_query="hashlib.md5")
benchmark2_time2 = run_benchmark(db2, num_tables=100, num_columns=10, benchmark_query="hashlib.md5")

benchmark1_time1, benchmark1_time2, benchmark2_time1, benchmark2_time2

This can be a fairly strong try at a benchmark! It has good feedback, abstracts the 2 benchmarks right into a single perform and runs that perform twice to check towards a complete of 200 tables.

Then it ran the code… and this occurred:

ChatGPT screenshot. Finished working... some Python code, and then a Traceback showing an error: table table0 already exists. ChatGPT then says: It looks like we are trying to create tables with the same names as the ones that already exist in the database. This is causing an OperationalError. To resolve this, we need to change the table names in the second set of tables to avoid conflicts. Additionally, I will update the code to use the benchmark_query parameter in a more intuitive way. We'll pass the actual query or function to be benchmarked to the run_benchmark function. Let me fix this issue and re-run the code.

It ran the code it generated… and bought a OperationalError: desk table0 already exists error.

That’s as a result of it ran that run_benchmark() perform twice towards the identical database, and the perform tried to re-create a desk it had created the primary time.

Nevertheless it didn’t simply cease on the error… it output an evidence of what it thought would occur, after which with none enter from me wrote a second model of the benchmark script:

# Outline a perform to create tables and run benchmarks
def run_benchmark(db, num_tables, num_columns, benchmark_query, initial_table_index=0):
    # Begin timing
    start_time = time.time()
    
    # Create tables
    for i in vary(initial_table_index, initial_table_index + num_tables):
        columns = ', '.be part of([f"col{j} INTEGER" for j in range(num_columns)])
        db.execute(f"CREATE TABLE desk{i} ({columns})")
    
    # Benchmark
    for i in vary(100):
        if benchmark_query == "PRAGMA schema_version":
            db.execute("PRAGMA schema_version")
        else:
            hashlib.md5(db.execute("SELECT group_concat(sql) FROM sqlite_master").fetchall()[0][0].encode()).hexdigest()
    
    # Calculate time taken
    elapsed_time = time.time() - start_time
    return elapsed_time

# Initialize in-memory SQLite databases
db1 = sqlite3.join(":reminiscence:")
db2 = sqlite3.join(":reminiscence:")

# Run the primary benchmark with PRAGMA schema_version
benchmark1_time1 = run_benchmark(db1, num_tables=100, num_columns=10, benchmark_query="PRAGMA schema_version")
benchmark1_time2 = run_benchmark(db1, num_tables=100, num_columns=10, benchmark_query="PRAGMA schema_version", initial_table_index=100)

# Run the second benchmark with hashlib.md5
benchmark2_time1 = run_benchmark(db2, num_tables=100, num_columns=10, benchmark_query="hashlib.md5")
benchmark2_time2 = run_benchmark(db2, num_tables=100, num_columns=10, benchmark_query="hashlib.md5", initial_table_index=100)

benchmark1_time1, benchmark1_time2, benchmark2_time1, benchmark2_time2

And this time it labored!

RESULT: (0.008260965347290039, 0.006290912628173828,  011242151260375977, 0.014682292938232422) - Here are the results of the benchmarks: For the first benchmark using PRAGMA schema_version: The time taken to create 100 tables and run PRAGMA schema_version 100 times: approximately 0.0083 seconds. The time taken to create another 100 tables (totaling 200 tables) and run PRAGMA schema_version 100 times: approximately 0.0063 seconds. For the second benchmark using hashlib.md5: The time taken to create 100 tables and run hashlib.md5(db.execute("SELECT group_concat(sql) FROM sqlite_master").fetchall()[0]).hexdigest() 100 times: approximately 0.0112 seconds. The time taken to create another 100 tables (totaling 200 tables) and run hashlib.md5(db.execute("SELECT group_concat(sql) FROM sqlite_master").fetchall()[0]).hexdigest() 100 times: approximately 0.0147 seconds. Please note that the time taken for these benchmarks may vary slightly with each run due to factors such as system performance and other processes running on the system.

Including a chart

This can be a very strong reply to my query. I checked it by way of and it largely seems good—I’d have written more-or-less the identical factor, however it could have taken me quite a bit longer to do it.

So I requested it for a plot:

Plot these benchmark outcomes over time on a chart

It wrote extra code. ChatGPT writes and executes code in the identical code session, so it had entry to the variables that it had beforehand populated:

import matplotlib.pyplot as plt

# Knowledge for benchmark outcomes
benchmark1_times = [benchmark1_time1, benchmark1_time2]
benchmark2_times = [benchmark2_time1, benchmark2_time2]

# x-axis representing the variety of tables at every benchmark
num_tables = [100, 200]

# Plotting the benchmark outcomes
plt.plot(num_tables, benchmark1_times, label='PRAGMA schema_version', marker='o')
plt.plot(num_tables, benchmark2_times, label='hashlib.md5', marker='o')

# Labels and title
plt.xlabel('Variety of Tables')
plt.ylabel('Time (seconds)')
plt.title('Benchmark Outcomes Over Time')
plt.legend()

# Present the plot
plt.present()

I’m not superb at utilizing the matplotlib charting libraries—I’ve to painstakingly piece the whole lot collectively each time I take advantage of them. ChatGPT simply did it.

Right here’s the chart it drew me:

Benchmark Results Over Time - two lines, one for PRAGMA schema_version and one for hashlib.md5. There are only two points on the chart - at 100 tables and at 200 tables - with straight lines between them.

This chart seems fairly good at first look—it has an acceptable title, axis labels and a legend.

When it comes to info density although it’s garbage. Charting simply two factors for every of the implementations, at 100 tables and 200 tables, is hardly price visualizing in any respect.

So I fed it a ultimate immediate:

Run the benchmark once more with the intention to draw a greater chart with measurements for each level between 1 and 200 tables

And that resulted within the chart I confirmed on the prime of my publish:

Benchmark Results Over Time - two lines, one for PRAGMA schema_version and one for hashlib.md5. The axis are time in seconds and number of tables - the hashlib.md5 line climes from around 0.000075 to 0.000150 after 200 tables have been created, while the PRAGMA schema_version line stays almost flat at 0.000050 at the bottom of the chart. It's a good chart: well labelled, easy to read.

Right here’s the code it wrote:

# Outline a perform to create tables and run a single benchmark iteration
def run_single_benchmark(db, num_tables, num_columns, benchmark_query, initial_table_index=0):
    # Begin timing
    start_time = time.time()
    
    # Create tables
    for i in vary(initial_table_index, initial_table_index + num_tables):
        columns = ', '.be part of([f"col{j} INTEGER" for j in range(num_columns)])
        db.execute(f"CREATE TABLE desk{i} ({columns})")
    
    # Benchmark
    if benchmark_query == "PRAGMA schema_version":
        db.execute("PRAGMA schema_version")
    else:
        hashlib.md5(db.execute("SELECT group_concat(sql) FROM sqlite_master").fetchall()[0][0].encode()).hexdigest()
    
    # Calculate time taken
    elapsed_time = time.time() - start_time
    return elapsed_time

# Initialize in-memory SQLite databases
db1 = sqlite3.join(":reminiscence:")
db2 = sqlite3.join(":reminiscence:")

# Lists to retailer benchmark outcomes
benchmark1_times = []
benchmark2_times = []

# Run the benchmarks for each level between 1 and 200 tables
for num in vary(1, 201):
    # Run the benchmark with PRAGMA schema_version
    benchmark1_time = run_single_benchmark(db1, num_tables=1, num_columns=10, benchmark_query="PRAGMA schema_version", initial_table_index=num-1)
    benchmark1_times.append(benchmark1_time)
    
    # Run the benchmark with hashlib.md5
    benchmark2_time = run_single_benchmark(db2, num_tables=1, num_columns=10, benchmark_query="hashlib.md5", initial_table_index=num-1)
    benchmark2_times.append(benchmark2_time)

# x-axis representing the variety of tables at every benchmark
num_tables = listing(vary(1, 201))

# Plotting the benchmark outcomes
plt.plot(num_tables, benchmark1_times, label='PRAGMA schema_version', marker='.')
plt.plot(num_tables, benchmark2_times, label='hashlib.md5', marker='.')

# Labels and title
plt.xlabel('Variety of Tables')
plt.ylabel('Time (seconds)')
plt.title('Benchmark Outcomes Over Time')
plt.legend()

# Present the plot
plt.present()

Full transcript

The complete challenge was accomplished utilizing simply the three prompts listed above!

I used my Observable notebook to transform the ChatGPT transcript JSON (extracted utilizing the Community pane within the browser instruments) to Markdown, and pasted it into this Gist: https://gist.github.com/simonw/2fc46effbfbe49e6de0bcfdc9e31b235

(I like utilizing “secret” Gists to share this sort of factor as a result of they’re mechanically marked by GitHub as <meta title="robots" content material="noindex">—which ought to hopefully stop them from polluting the net with LLM-generated content material. Although I’ve pasted a LOT of ChatGPT-written code into this search-indexed weblog entry on this explicit case.)

A very bizarre form of intern

Right here’s one other analogy for giant language fashions (since we will by no means have too lots of these). It truthfully feels somewhat like having a coding intern, with an odd set of traits:

  • They’ve learn and memorized all the world’s public coding documentation, although they stopped studying in September 2021.
  • They’re unbelievably quick—they take simply seconds to output complete packages from scratch.
  • If their code produces an error they’ll learn the error message and try and replace their code to repair it.
  • They gained’t argue with you. That’s not essentially a superb factor: if you happen to ask them to construct one thing that’s clearly a nasty thought they’ll doubtless simply go forward and do it anyway.
  • They reply extremely effectively to suggestions—they’ll apply it inside seconds.
  • You must know the right way to immediate them. I’ve bought fairly good at this now, nevertheless it takes lots of trial and error, and I feel requires a reasonably deep understanding of how they work.
  • You must carefully assessment EVERYTHING they do.
  • They work free of charge.

And naturally, they’ve zero understanding of something in any respect. They’re a next-token-predicting machine with an unimaginably massive coaching set.

The actual fact they’ll do even a fraction of the issues they’ll do is, fairly frankly, unbelievable. I’m nonetheless unsure I consider it myself.

One thing I missed

After publishing this publish, I spotted that I’d missed a mistake ChatGPT had made.

I wished to match the time taken to execute PRAGMA schema_version v.s. calculating the MD5 hash of choose group_concat(sql) from sqlite_master. However… ChatGPT had began the timer with start_time = time.time() earlier than creating the tables—so the time measurement included the desk creation time.

This didn’t have an effect on the comparability between the 2, nevertheless it did imply that I wasn’t getting the underlying numbers that I most cared about.

Like I stated earlier, You must carefully assessment EVERYTHING they do. I’m embarrassed I missed this!

I mounted it with one final immediate:

Report the beginning time after the # Benchmark remark, then re-run and re-plot

It rewrote that block of code to appear like this:

    # Benchmark
    # Begin timing
    start_time = time.time()
    if benchmark_query == "PRAGMA schema_version":
        db.execute("PRAGMA schema_version")
    else:
        hashlib.md5(db.execute(
            "SELECT group_concat(sql) FROM sqlite_master"
        ).fetchall()[0][0].encode()).hexdigest()

    # Calculate time taken
    elapsed_time = time.time() - start_time
    return elapsed_time

And generated the next chart:

The same chart again, but this time the PRAGMA line sticks right by 0.0000 seconds and the other line climbs up to aronud 0.00005 seconds

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top