localmodels

module
v0.0.0-...-da23216 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2023 License: MIT

README

Testdriving OLLAMA

2023-11-21, Leipzig Gophers Meetup #38, Martin Czygan, L Gopher, Open Data Engineer at IA

Short talk about running local models, using Go tools.

Personal Timeline

"What a difference a week makes"

I am going to assert that Riley is the first Staff Prompt Engineer hired anywhere.

  • on 2023-02-14 (+9w), I ask a question on how long before we can run things locally at the Leipzig Python User Group -- personally, I expected 2-5 years timeline
  • on 2023-04-18 (+9w), we discuss C/GO and ggml (ai-on-the-edge) at Leipzig Gophers #35
  • on 2023-07-20 (+13w), ollama is released (with two models), HN
  • on 2023-11-21 (+17w), today, 43 models (each with a couple of tags/versions)

Confusion

Turing Test was proposed in 1950. From Nature, 2023-07-23: Understanding ChatGPT is a bold new challenge for science

This lack of robustness signals a lack of reliability in the real world.

What I cannot create, I do not understand.

Open models not binary:

We propose a framework to assess six levels of access to generative AI systems, from The Gradient of Generative AI Release: Methods and Considerations:

  • fully closed
  • gradual or staged access
  • hosted access
  • cloud-based or API access
  • downloadable access and
  • fully open.

A prolific AI Researcher (with 387K citations in the past 5 years) believes open source AI is ok for less capable models: Open-Source vs. Closed-Source AI

For today, let's focus on Go. Go is a nice infra language, what projects exist for model infra?

  • going to look at a tool, from the outside and a bit from the inside

POLL

OLLAMA

  • first appeared in 07/2023 (~18 weeks ago)
  • very inspired by docker, not images, but models
  • built on llama (meta), GGML ai-on-the-edge ecosystem, especially using GGUF - a unified image format
  • docker may be considered less a glorified nsenter, but more (lots of) glue to go from spec to image to process, code lifecycle management; similarly ollama may be a way to organize the ai "model lifecycle"
  • clean developer UX
Time-to-chat

From zero to chat in about 5 minutes, on a power-efficient CPU. Started w/ 2 models, as of 11/2023 hosting 43 models.

$ git clone git@github.com:jmorganca/ollama.git
$ cd ollama
$ go generate ./... && go build . # cp ollama ...

Follows a client server model, like docker.

$ ollama serve

Once it is running, we can pull models.

$ ollama pull llama2
pulling manifest
pulling 22f7f8ef5f4c... 100% |..
pulling 8c17c2ebb0ea... 100% |..
pulling 7c23fb36d801... 100% |..
pulling 2e0493f67d0c... 100% |..
pulling 2759286baa87... 100% |..
pulling 5407e3188df9... 100% |..
verifying sha256 digest
writing manifest
removing any unused layers
success
Some examples
$ ollama run zephyr
>>> please complete: {"author": "Turing, Alan", "title" ... }

{
  "author": "Alan Turing",
  "title": "On Computable Numbers, With an Application to the Entscheidungsproblem",
  "publication_date": "1936-07-15",
  "journal": "Proceedings of the London Mathematical Society. Series 2",
  "volume": "42",
  "pages": "230–265"
}

Formatting mine.

More

The whole prompt engineering thing is kind of mysterious to me. Do you get better output by showing emotions?

To this end, we first conduct automatic experiments on 45 tasks using various LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4.

Batch Mode
[GIN-debug] POST   /api/pull       --> gith...m/jmo...ma/server.PullModelHandler (5 handlers)
[GIN-debug] POST   /api/generate   --> gith...m/jmo...ma/server.GenerateHandler (5 handlers)
[GIN-debug] POST   /api/embeddings --> gith...m/jmo...ma/server.EmbeddingHandler (5 handlers)
[GIN-debug] POST   /api/create     --> gith...m/jmo...ma/server.CreateModelHandler (5 handlers)
[GIN-debug] POST   /api/push       --> gith...m/jmo...ma/server.PushModelHandler (5 handlers)
[GIN-debug] POST   /api/copy       --> gith...m/jmo...ma/server.CopyModelHandler (5 handlers)
[GIN-debug] DELETE /api/delete     --> gith...m/jmo...ma/server.DeleteModelHandler (5 handlers)
[GIN-debug] POST   /api/show       --> gith...m/jmo...ma/server.ShowModelHandler (5 handlers)
[GIN-debug] GET    /               --> gith...m/jmo...ma/server.Serve.func2 (5 handlers)
[GIN-debug] GET    /api/tags       --> gith...m/jmo...ma/server.ListModelsHandler (5 handlers)
[GIN-debug] HEAD   /               --> gith...m/jmo...ma/server.Serve.func2 (5 handlers)
[GIN-debug] HEAD   /api/tags       --> gith...m/jmo...ma/server.ListModelsHandler (5 handlers)

Specifically /api/generate/

Constraints
  • possible to enforce JSON generation
Customizing models

weights, configuration, and data in a single package

Using a Modelfile.

FROM llama2
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system prompt to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.

Freeze this as a custom package:

$ ollama create llama-mario -f custom/Modelfile.mario
$ ollama run llama-mario

About 16 parameters to tweak: Valid Parameters and Values

Task 1: "haiku"

  • generate a small volume of Go programming haiku
// haikugen generates
// JSON output for later eval
// cannot parallelize

Task 2: "bibliography"

  • given unstructured strings, parse the to json
  • unstructured

Credits

Directories

Path Synopsis
tasks
haiku
haikugen generates JSON output for later eval cannot parallelize
haikugen generates JSON output for later eval cannot parallelize
unstructured
haikugen generates JSON output for later eval cannot parallelize
haikugen generates JSON output for later eval cannot parallelize

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL