Aquileo | KyleHessling1/Qwopus-GLM-18B-Merged-GGUF · Best tool-use model for 16GB VRAM GPU

Best tool-use model for 16GB VRAM GPU

#2
by a123451 - opened

I tried a bunch of latest models with various built from tbq plus, buun,ik_llammaetc...none of them fixed the issued I have with Gbrain. Models that finetuned for Hermes like Carnice v2 27B can but painfully slow. Probably just me being newb.
Anyway, this one does the job. 42t/s for long context length, drop to 40 at the end , GBrain came back 90/100 with 3132 bookmarks full enrichment, blacklink fix, embedding coverage 100%, graph stats( wikilinks created, timeline entries, graph_coverage %) all done.
I'm using Tom turboquant plus built with these parameters:
--model "D:\llama-models\Qwopus-GLM-18B-Healed-Q6_K.gguf" ^
-fa 1 ^
-ctk q4_0 -ctv q4_0 ^
-c 131064 ^
-ngl 99 ^
--batch-size 2048 ^
--ubatch-size 512 ^
-np 1 ^
--temp 0.7 ^
--top-k 40 ^
--top-p 0.9 ^
--repeat-penalty 1.1 ^
--min-p 0.05 ^
--jinja ^
Any plan to make another one for qwopus36-35b-a3b please?

Hey there!

Thank you for the feedback. I'm pumped to hear the model is doing so well! I am currently working on another merge; we'll see how it goes! A similar duplication merge with Qwopus 3.6 35B a 3b is possible but we need to make another fine-tune of it, which is in the works with my friend Jackrong, so in the future we could even do a merge on that!

Thank you for the support and flags, this is incredibly useful and motivating!
-Kyle

Q4_K-M testing on Ubutun 24.04 LTS ,i got extra 20 tps , haven't seen any problem with long context yet. However, there is issue that the model offload all layers on GPU but only CPU compute, i don't know if this llama cpp build , mine is built from turboquant plus or it's the model. To reproduce, set -ctk q8_0 -ctv q4_0 -> check for CPU loading, mine show >900% while gpu 0~1%. No issue with using turboquant, -ctk q8_0 -ctv turbo4.

Sign up or log in to comment