Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

Closed
bsiminski opened this issue Mar 11, 2023 · 40 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed windows Issues specific to Windows

Comments

@bsiminski
Copy link

First of all thremendous work Georgi! I managed to run your project with a small adjustments on:

  • Intel(R) Core(TM) i7-10700T CPU @ 2.00GHz / 16GB as x64 bit app, it takes around 5GB of RAM.

image

image

Here is the list of those small fixes:

  • main.cpp: added ggml_time_init() at start of main (division by zero otherwise)
  • quantize.cpp: same as above at start of main (division by zero otherwise)
  • ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)
  • ggml.c: replace fopen with fopen_s (VS secure error message)
  • ggml.c: below changes due to 'expression must be a pointer or complete object type':
  1. 2x (uint8_t*)(y to: ((uint8_t*)y
  2. 4x (const uint8_t*)(x to ((const uint8_t*)x
  3. 2x (const uint8_t*)(y to ((const uint8_t*)y
  • quantize.cpp: removed qk in ggml_quantize_q4_0 & ggml_quantize_q4_1 calls
  • utils.cpp: use of QK value instead of parameter value (VS raise error for: uint8_t pp[qk / 2];)

It would be really great if you could incorporate those small fixes.

@etra0
Copy link
Collaborator

etra0 commented Mar 11, 2023

Interesting, doing these changes (and a couple of more hacks) I was able to run the 13B model on my HW (AMD Ryzen 7 3700X 8-Core Processor, 3593 Mhz, 8 Core(s), 16 Logical Processor(s), 32gb RAM) and I was able to get 268ms per token, with around 8GB of ram usage!

I forced the usage of AVX2 and that gave a huge speed up.

etra0 added a commit to etra0/llama.cpp that referenced this issue Mar 11, 2023
@bsiminski
Copy link
Author

@etra0 here are my 13B model tests, based on number of threads & AVX2 (thanks!):

4: 3809.57 ms per token (default settings)
8: 3617.09 ms per token (default settings)
12: 2967.79 ms per token (default settings)

4: 495.08 ms per token (with AVX2)
8: 519.78 ms per token (with AVX2)
12: 490.53 ms per token (with AVX2)

Clearly AVX2 gives a huge boost. I see however that you are still way ahead with your 268 ms. What other optimizations do you have?

@ggerganov
Copy link
Owner

Yes, AVX2 flags are very important for high performance.
Could you wrap these changes in a PR?

ggml.c: #define QK 32 moved to dedicated define.h (should not be in .c)

This is not very desirable - I don't want an extra file added. Although the QK constants everywhere are indeed problematic.
Some other fix?

@etra0
Copy link
Collaborator

etra0 commented Mar 11, 2023

Could you wrap these changes in a PR?

I could do that, but I'm unsure whether to create a Solution, or move the project to CMake, because Windows doesn't support Make by default, sadly.

I always try to avoid Solutions because they're not multiplatform, but from looking at the makefile, rewriting it to CMake would take a bit more time. In the meantime I could do a PR to fix the things that won't compile.

@ggerganov
Copy link
Owner

CMake is better than Solutions. The https://github.com/ggerganov/whisper.cpp project has a CMake build system that is compatible for Windows and the project is very similar. It should be easy to adapt

@kamyker
Copy link

kamyker commented Mar 12, 2023

Great! These changes finally fixed compilation for me using VS cl command (#2) and also cmake with @etra0 repo.

I get 140ms per token on i9900k and about 5gb ram usage with 7B.

Unfortunately bigger prompts are kind of unusable. Dno if it's windows issue or this library isn't yet optimized in this case. Making hardcoded 512 token limit a parameter was easy to change but it's too slow as it repeats all prompt tokens.

@ggerganov ggerganov added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Mar 12, 2023
@ggerganov
Copy link
Owner

ggerganov commented Mar 12, 2023

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

if (!llama_model_load(params.model, model, vocab, 512)) { // TODO: set context from user input ??

Haven't tested if it works with other values

@jaykrell
Copy link

I didn't see your PR when I read the issue so went ahead and made one, very similar.
I made the existing Makefile work on Unix and Microsoft nmake.
#36

@0xbitches
Copy link

0xbitches commented Mar 12, 2023

Using the fix in #31, however, the results from 4 bit models are still repetitive nonsense. FP16 works but the results are also very bad.

Relevant spec: Intel-13700k, 240ms/token
Built make.exe with mwing64

@kamyker
Copy link

kamyker commented Mar 12, 2023

@kamyker

Maybe the context size has to be increased - it's currently hardcoded to 512:

if (!llama_model_load(params.model, model, vocab, 512)) { // TODO: set context from user input ??

Haven't tested if it works with other values

As I said, I made parameter out of it and it fixes longer prompts but they are still slow. What I'm saying is that without some kind of quicker prompt loading/caching this is very far from ChatGPT.

How's let's say 300 tokens prompt for you?

@teknium1
Copy link

Any chance we could publish binaries for windows?

@jaykrell
Copy link

@teknium1

Any chance we could publish binaries for windows?

Here https://github.com/jaykrell/llama.cpp/releases/tag/1
but perhaps that is kinda rude of me. I'll delete if there are objections.

@bsiminski
Copy link
Author

Here is an updated fork based on initial adjustments done by @etra0
Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

@etra0
Copy link
Collaborator

etra0 commented Mar 12, 2023

Here is an updated fork based on initial adjustments done by @etra0 Visual Studio 2022 - vsproj version

@etra0 kindly ask to merge my pull request and push it to @ggerganov repo.

I don't think I'll merge this, sadly. I don't want to add solutions to the project, I'd rather go with the nmake solution or finish writing the CMake.

@kbalint
Copy link

kbalint commented Mar 12, 2023

@jaykrell thank you for your work, I've tried it and it worked! However the quantizer seemed like it run, but didn't produce any bin files (tried with 7b and 13b), but I could run with the original model on an i5-9600k 10 times slower, but. :D

ggerganov pushed a commit that referenced this issue Mar 12, 2023
* Apply fixes suggested to build on windows

Issue: #22

* Remove unsupported VLAs

* MSVC: Remove features that are only available on MSVC C++20.

* Fix zero initialization of the other fields.

* Change the use of vector for stack allocations.
@lucasjinreal
Copy link

@ggerganov would this suport merge to master?

@ShouNichi
Copy link

Successfully compilied this on msys2(ucrt).

@etra0
Copy link
Collaborator

etra0 commented Mar 13, 2023

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.

If you pull my changes, you can build the project with the following instructions:

# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

@kamyker
Copy link

kamyker commented Mar 13, 2023

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

Small feedback: llama.exe should be renamed to main.exe somewhere to be consistent with readme commands.

@bitRAKE
Copy link
Collaborator

bitRAKE commented Mar 13, 2023

I was able to build with clang (from VS2022 prompt), without any changes:

clang -march=native -O3 -fuse-ld=lld-link -flto main.cpp ggml.c utils.cpp
clang -march=native -O3 -fuse-ld=lld-link -flto quantize.cpp ggml.c utils.cpp

Seems to be 10% faster (than timings in #39), ymmv.

@Zerogoki00
Copy link

I did the initial draft for CMake support which allows this to be built for Windows as well. You can check the PR at #75.

If you pull my changes, you can build the project with the following instructions:

# Assuming you're using PowerShell
mkdir build
cd build
cmake ..
cmake --build . --config Release

That will build the two executables, quantize.exe, llama.exe, then you can use it from the root llama.cpp directory like

./build/Release/llama.exe -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

The PR is a draft because I need to also update the instructions I guess, but it's pretty much usable right now.

EDIT: You can also open the llama.cpp folder in Visual Studio 2019 or newer and it should detect the CMake settings automatically and then just build it.

cc @jinfagang.

I installed VS 2022 build tools, installed MSVC and cmake

But I get this error:

C:\Users\quela\Downloads\LLaMA\llama.cpp\build>cmake ..
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version  to target Windows 10.0.22621.
-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_C_COMPILER could be found.



CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_CXX_COMPILER could be found.



-- Configuring incomplete, errors occurred!
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "C:/Users/quela/Downloads/LLaMA/llama.cpp/build/CMakeFiles/CMakeError.log".

What am I doing wrong?

@etra0
Copy link
Collaborator

etra0 commented Mar 13, 2023

@Zerogoki00 From the looks of it, it seems that you have no C/C++ compiler. Did you make sure selecting C++ development when installing build tools?

@kamyker
Copy link

kamyker commented Mar 14, 2023

Builds fine for me.

Interactive mode doesn't work correctly, program ends after first generation.

@1octopus1
Copy link

help me please

main: seed = 1678814584
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: failed to open './models/7B/ggml-model-q4_0.bin'
main: failed to load model from './models/7B/ggml-model-q4_0.bin'

@1octopus1
Copy link

help me please
image

@etra0
Copy link
Collaborator

etra0 commented Mar 14, 2023

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

@1octopus1
Copy link

1octopus1 commented Mar 14, 2023

@1octopus1 those warnings are 'normal' as-in, it doesn't have to do with your errors. Did you do all the rest of the steps? quantize the model and all? the fixes mentioned here are just to build main (llama.exe) and quantize (quantize.exe), you still need to follow the rest of the README.

I know we still need to update the instructions for Windows, but I just haven't found the time yet.

Yes, I did everything according to the instructions. Okay, I'll wait for the updated instructions. And then several hours trying to start =) Just write it in detail, please, with each step =) Thank you very much.

@RedLeader721
Copy link

Interactive Mode not working right. It returns to the Bash command prompt after the first message:
$ ./Release/llama.exe -m ../../../Users/ron/llama.cpp/models/7B/ggml-model-q4_0.bin -t 8 --repeat_penalty 1.2 --temp 0.9 --top_p 0.9 -n 256 --color -i -r "User:" -p "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision."

== Running in interactive mode. ==

  • Press Return to return control to LLaMa.
  • If you want to submit another line, end your input in ''.
    Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
    User:I really want a pizza.
    Assistant [Bob]:OK then, what do you want? [end of text]

main: mem per token = 14434244 bytes
main: load time = 3234.19 ms
main: sample time = 12.65 ms
main: predict time = 12828.59 ms / 183.27 ms per token
main: total time = 31762.91 ms
(venv)
ron@LAPTOP-JIBCUHGM MINGW64 /c/llama/llama.cpp/build (master)
$

@eldash666
Copy link

eldash666 commented Mar 15, 2023

Hello everyone) how do I install it, and how to turn it on and off on my PC, who can explain? I have Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz 2.40GHz . 12.0 GB (available: 11.9 GB); Windows 11 Pro. I hope it will work fine.

@bitRAKE
Copy link
Collaborator

bitRAKE commented Mar 15, 2023

Assuming you are at a VS2022 command prompt and you've installed git/cmake support through the VS Installer:

set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

If you installed another git then that first line might not be needed. Yeah, MS decided not to add git to the path, doh!

Building the repo gives you llama.exe and quantize.exe in the llama.cpp\build\Release directory. You'll need to convert and quantize the model by following the directions for that.

I can't really help beyond that because I have a different build environment I'm using clang from the terminal.

@eldash666 12GB might be tight.

@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt #199 - give an example or two, lead the model as to what you want and it will follow.

@tmzncty
Copy link

tmzncty commented Mar 16, 2023

Assuming you are at a VS2022 command prompt and you've installed git/cmake support through the VS Installer:

set PATH=%DevEnvDir%CommonExtensions\Microsoft\TeamFoundation\Team Explorer\Git\cmd;%PATH%
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

If you installed another git then that first line might not be needed. Yeah, MS decided not to add git to the path, doh!

Building the repo gives you llama.exe and quantize.exe in the llama.cpp\build\Release directory. You'll need to convert and quantize the model by following the directions for that.

I can't really help beyond that because I have a different build environment I'm using clang from the terminal.

@eldash666 12GB might be tight.

@RedLeader721, Interactive mode has several issues. First #120 is need for Windows support for Ctrl-C handler. Second it's possible for the reverse prompt to appear as different tokens and be ignored. Also, I'd try a better prompt - give an example or two, lead the model as to what you want and it will follow.

Thanks~
image

@tmzncty
Copy link

tmzncty commented Mar 16, 2023

我用中文完整的叙述一下吧,如果英语国家的请自行翻译。(I only know a little English)
直接按照
image这位大佬的方法编译,记得用
image
这个东西把cmake装上去。
image
然后愉快的编译开始了,等待即可。
image

接着我们会发现得到三个文件
image
后面两个EXE是需要用到的。
然后我们来转化模型(链接:https://pan.baidu.com/s/1Y7YWdFWX1Yzy2Yuujp8Tqg?pwd=1p5n
提取码:1p5n
--来自百度网盘超级会员V4的分享

直接写原模型的绝对路径(在实操过程中很多时候会被路径坑死)
image
python convert-pth-to-ggml.py B:LLaMA/7B 1
然后等待
image
完成后利用之前编译好的quantize.exe进一步转换
quantize.exe ggml-model-f16.bin ggml-model-q4.bin 2
image

等待完成。
image

然后看你心情把llama.exe加入环境变量还是直接拖过来操作,参数都有给出,照着来即可。(记得把GBK改成UTF,该死的编码问题。)
llama.exe -m ggml-model-q4.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i
image
玩得开心。

lapo-luchini added a commit to lapo-luchini/bloomz.cpp that referenced this issue Mar 16, 2023
Mostly taken from ggerganov/llama.cpp#22

Some might be unnecessary, this is the first version I managed to run.
@mulyadi
Copy link

mulyadi commented Mar 28, 2023

Does anyone have the binary quantize.exe? Mine doesn't process the FP16 files. There is no error message, but there is no output file. Please publish the file if you have a working one. Thank you.

@Beeplex64
Copy link

I wrote a windows version of "quantize.sh".
If you want to use this, just copy code down below and paste to notepad.
Then save as "quantize.bat".

@echo off
setlocal enabledelayedexpansion
cd /d %~dp0

set PARAM_CHECK=FALSE
set MODEL_TYPE=%1
set PARAM="%2"

rem Is there a way to use findstr?
IF %MODEL_TYPE%==7B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==13B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==30B set PARAM_CHECK=TRUE
IF %MODEL_TYPE%==65B set PARAM_CHECK=TRUE

if %PARAM_CHECK%==FALSE (
    echo;
    echo "Usage: quantize.sh 7B|13B|30B|65B [--remove-f16]"
    echo;
    exit 1
)

for %%i in (models/%MODEL_TYPE%/ggml-model-f16.bin*) do (
    call :Quantize %%i
)
exit 0

:Quantize
    set INPUT_MODEL=%1
    set OUTPUT_MODEL=!INPUT_MODEL:f16=q4_0!
    call quantize.exe models\%MODEL_TYPE%\%INPUT_MODEL% models\%MODEL_TYPE%\%OUTPUT_MODEL% 2
    if %PARAM%=="--remove-f16" (
        call del models\%MODEL_TYPE%\%INPUT_MODEL%
    )

@mulyadi
Copy link

mulyadi commented Mar 28, 2023

@Beeplex64 the output file is still not created after using tour BAT file. Can you please publish your quantize.exe? Not sure why the one that I compiled doesn't work. Thank you.

CC @tmzncty I saw your snapshot, if you could publish the binary quantize.exe, I would appreciate it. Thank you.

@huangl22
Copy link

huangl22 commented Apr 1, 2023

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

@danskycode
Copy link

danskycode commented Apr 1, 2023

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

@huangl22 Check the directory llama.cpp\build\bin\Release - assuming you saw the llama.lib in llama.cpp\build\Release

@huangl22
Copy link

huangl22 commented Apr 1, 2023

Does anyone have the binary quantize.exe and llama.exe? I just have the llama.lib after cmake build openration. How can i deal with it?

@huangl22 Check the directory llama.cpp\build\bin\Release - assuming you saw the llama.lib in llama.cpp\build\Release

there is quantize.exe in llama.cpp\build\bin\Release, but there isn't llama.exe in it.

@prusnak prusnak added the windows Issues specific to Windows label Apr 1, 2023
@kevingosse
Copy link

kevingosse commented Apr 1, 2023

@huangl22 main.exe is the old llama.exe #22 (comment)

@sw
Copy link
Collaborator

sw commented Apr 16, 2023

Closing this as there doesn't seem to be a concrete issue on Windows anymore, and we have CI checks now. If you still have problems, please open a new issue.

@sw sw closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed windows Issues specific to Windows
Projects
None yet
Development

No branches or pull requests