Analysing CSV files with Langchain (ChatGPT)

Hi all,

Can we get OpenAI to answer our questions based on a csv input?

We are back with another coding snippet this week. This week focussing on Langchain and how we can autogenerate answers using the capabilities of OpenAI.

The above printscreen isn’t the prettiest end output, but is what we will be striving towards today and I am very excited to showcase it.

This tutorial will look to show how we can use the OpenAI package and langchain, to look at a csv file and ask it questions about the file and the agent will send back a response.

Today we will look at LLMs.

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app – the real power comes when you can combine them with other sources of computation or knowledge. Read more about it here

Firstly, let’s look at how we can get set up.

We will need

A Python IDE (I choose to use PyCharm)
An OpenAI account and an API key generated.
A dataset to analyse (I Choose tennis)

As a brief refresher, here is where you can access your API token. It is stored under User, where you can create a secret key.

You can see your amount of usage of your free credits under the usage section.

By now, you will know my IDE tool of choice tends to be PyCharm.

Copy and paste the code snippet into your own main.py file.

Make sure to add in your api key into the code.

Download the ATP Rankings dataset from the repo. I chose this dataset from a previous tutorial I wrote – I felt like it was a nice easy dataset to interpret showing the rankings of the top 500 mens players…. back in November-ish time.

So for this to work you will need to be on python 3.9 – I was previously on 3.8 and it was not compatible and was throwing up errors.

Also you will need to pip install your packages. For this we need OpenAI, langchain, os and pandas.

You’ll know this is done as there will be no underlined red in the code anymore.

Sometimes in the terminal it doesn’t fully load my packages so I like to go into the interpreter to manually add them there.

Once that’s all done we are ready to run the code. But before we get into the responses, lets look at why we are doing this.

What are agents?

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.

If you’d like to learn more about Langchain you can read about it here.

In our example we want to execute three simple questions against our csv.

Firstly, how many rows are there?

Secondly, how many individuals have the first name Pedro? What are there full names?

Thirdly, How many more points did the Top 10 players currently have compared to the bottom 10 players?

Lets look at this in action.

When we run our command, it translates our question into a thought. It will then take that thought and create an action, almost as if you ask an analyst how would you answer it in code. It then computes that action and gives us an observation. Of course it is no surprise our Top 500 ranking of mens tennis has 500 records in.

It then has a final thought that it knows the answer, and states it to us as an answer.

Nice and simple!

Lets go on to do something a little more challenging and find individuals called Pedro.

Now if I was to ask the agent, how many people in the dataset are called Pedro it will give us zero.

So we have to be a little bit more precise.

Lets ask it how many individuals have the FIRST name Pedro.

Well of course now it understands. It tells us the code it would use to compute the answer, it prints all those in the dataset called Pedro and gives us a written response too.

Pretty impressive.

So our final question. A mathematical computation question. Now I ask it to compare the Top 10 ranked individuals against the Bottom 10 and work out the number of points difference.

And there we are. The correct value finding the difference between them.

Today we have gone over some basic examples of calculations against a csv file, but this is really only just touching the surface of what is possible.

Going further:

Try look at other types of files you can analyse including pdf’s.
Load your own CSV file in and ask that questions.
Try debug errors in what type of questions it can’t respond to.
Try creating a log of questions answers and responses.

That’s it for this week, catch you soon.

LOGGING OFF

Analysing CSV files with Langchain (ChatGPT)

Related

Discover more from CJ Mayes