Any LET providers plan to focus on AI domain - Hosting Open Source LLM's

sreekanth850 · December 2023

Any LET providers plan to focus on AI domain, which needs GPU servers to run Open source models. Industry has a huge shift towards opensource LLM's, I'm wondering if anyone here plan to focus that segment?.

sh97 · December 2023

@crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

sreekanth850 · December 2023

@sh97 said:
@crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

M66B · December 2023

Maybe you need this?

https://deepinfra.com/

shruub · December 2023

I think c1v hosting labels some of their services as AI, just don't wonder that the site loads a few minutes. Otherwise, you're probably looking at the big providers like Google, Paperspace, AWS and stuff like that.

sreekanth850 · December 2023

@M66B said:
Maybe you need this?

https://deepinfra.com/

Wow this is new for me. This seems super cheap and exactly looking something like this. Thanks for sharing,

bh4tech · December 2023

@sreekanth850 Are you planning to use any? If yes, I can provide you with an open source llm(microsoft phi)along with necessary hardware(AMD GPU with 16GB VRAM, not NVIDIA) and software(llama.cpp)to run it at INR 9000/month(payment UPI only).Edit: Price reduced to INR 6600/month. You can check the generation speed and other details from output of two prompts below-

Question: You are a bank clerk. Write a paragraph about your daily life.
Answer: As a bank clerk, my daily life revolves around providing excellent customer service and managing transactions efficiently. I start my day by arriving at the bank branch early in the morning to ensure everything is in order before opening for the day. Throughout the day, I assist customers with various banking needs such as account inquiries, deposit withdrawals, and loan approvals. It is crucial to maintain accuracy while handling sensitive financial information and ensuring customer satisfaction. At the end of each day, I reconcile accounts, file reports, and communicate with my team to discuss any outstanding issues or improvements for the upcoming days.
[end of text]

llama_print_timings: load time = 155.09 ms
llama_print_timings: sample time = 22.41 ms / 121 runs ( 0.19 ms per token, 5399.13 tokens per second)
llama_print_timings: prompt eval time = 257.93 ms / 16 tokens ( 16.12 ms per token, 62.03 tokens per second)
llama_print_timings: eval time = 8449.64 ms / 120 runs ( 70.41 ms per token, 14.20 tokens per second)
llama_print_timings: total time = 8778.03 ms
Log end

root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is transformers better than RNN for implementing a large language model? "
Log start
main: build = 1695 (925e558)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1703511638
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 10.3
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi2.block_count u32 = 24
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = Phi2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.09 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used = 2706.37 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 111.19 MiB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

Question: Why is transformers better than RNN for implementing a large language model?
Answer: Transformers are more flexible and efficient in handling variable input lengths, which makes them better suited to handle long texts. They can also be trained on diverse text data sets without needing specialized preprocessing steps like sequence padding or tokenization. Moreover, they leverage the power of self-attention mechanisms that RNNs lack.
[end of text]

llama_print_timings: load time = 155.69 ms
llama_print_timings: sample time = 12.06 ms / 68 runs ( 0.18 ms per token, 5637.07 tokens per second)
llama_print_timings: prompt eval time = 300.38 ms / 18 tokens ( 16.69 ms per token, 59.93 tokens per second)
llama_print_timings: eval time = 4727.56 ms / 67 runs ( 70.56 ms per token, 14.17 tokens per second)
llama_print_timings: total time = 5066.72 ms
Log end

PS: Not a provider, using the dedicated server for some other tasks but discovered that it works well for LLM inference also. Also, since whole 16GB VRAM is not used, 3-4 LLMs can be run simultaneously as the server used just 30-40% GPU utilisation while inferencing.
Offer available to not only him but anyone who can make payment via UPI.
For queries, counter-offer, please DM

sreekanth850 · December 2023

@bh4tech said:
@sreekanth850 Are you planning to use any? If yes, I can provide you with an open source llm(microsoft phi)along with necessary hardware(AMD GPU with 16GB VRAM, not NVIDIA) and software(llama.cpp)to run it at INR 9000/month(payment UPI only). You can check the generation speed and other details from output of two prompts below-

Question: You are a bank clerk. Write a paragraph about your daily life.
Answer: As a bank clerk, my daily life revolves around providing excellent customer service and managing transactions efficiently. I start my day by arriving at the bank branch early in the morning to ensure everything is in order before opening for the day. Throughout the day, I assist customers with various banking needs such as account inquiries, deposit withdrawals, and loan approvals. It is crucial to maintain accuracy while handling sensitive financial information and ensuring customer satisfaction. At the end of each day, I reconcile accounts, file reports, and communicate with my team to discuss any outstanding issues or improvements for the upcoming days.
[end of text]

llama_print_timings: load time = 155.09 ms
llama_print_timings: sample time = 22.41 ms / 121 runs ( 0.19 ms per token, 5399.13 tokens per second)
llama_print_timings: prompt eval time = 257.93 ms / 16 tokens ( 16.12 ms per token, 62.03 tokens per second)
llama_print_timings: eval time = 8449.64 ms / 120 runs ( 70.41 ms per token, 14.20 tokens per second)
llama_print_timings: total time = 8778.03 ms
Log end

root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is transformers better than RNN for implementing a large language model? "
Log start
main: build = 1695 (925e558)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1703511638
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 10.3
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi2.block_count u32 = 24
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = Phi2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.09 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used = 2706.37 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 111.19 MiB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

Question: Why is transformers better than RNN for implementing a large language model?
Answer: Transformers are more flexible and efficient in handling variable input lengths, which makes them better suited to handle long texts. They can also be trained on diverse text data sets without needing specialized preprocessing steps like sequence padding or tokenization. Moreover, they leverage the power of self-attention mechanisms that RNNs lack.
[end of text]

llama_print_timings: load time = 155.69 ms
llama_print_timings: sample time = 12.06 ms / 68 runs ( 0.18 ms per token, 5637.07 tokens per second)
llama_print_timings: prompt eval time = 300.38 ms / 18 tokens ( 16.69 ms per token, 59.93 tokens per second)
llama_print_timings: eval time = 4727.56 ms / 67 runs ( 70.56 ms per token, 14.17 tokens per second)
llama_print_timings: total time = 5066.72 ms
Log end

PS: Not a provider, using the dedicated server for some other tasks but discovered that it works well for LLM inference also. Also, since whole 16GB VRAM is not used, 3-4 LLMs can be run simultaneously as the server used just 30-40% GPU utilisation while inferencing.
Offer available to not only him but anyone who can make payment via UPI.
For queries, counter-offer, please DM

As we are prototyping something, we cannot afford fully dedicated instance. Planning to start with deep infra and Open AI initially.

bh4tech · December 2023

@sreekanth850 Looking at deepinfra pricing, I realised that I made an overpriced quote. So, reducing it to INR 6600/month. You can easily run 3 PHIs parallely on the machine.

bh4tech · December 2023

I was running some other resource intensive task in the background (about which I had totally forgotten) while generating the earlier outputs and so was getting half performance (~60 tokens/second as compared to the optimal output of ~120 tokens/second). I realised it just now have stopped the background task. Here is the fresh report with no background tasks running-

Question: Why is transformers better than RNN for implementing a large language model?
Answer: Transformers are more efficient and require less computational resources to train on large datasets. They can handle long sentences without encountering memory limitations, making them ideal for handling text data in natural language processing tasks like sentiment analysis or machine translation. In contrast, RNNs struggle with large sequences due to the need for recurrent neural networks' complex architectures, which consume significant amounts of memory and time to train effectively on such datasets.

[end of text]

llama_print_timings: load time = 102.19 ms
llama_print_timings: sample time = 13.90 ms / 86 runs ( 0.16 ms per token, 6185.27 tokens per second)
llama_print_timings: prompt eval time = 145.17 ms / 18 tokens ( 8.06 ms per token, 123.99 tokens per second)
llama_print_timings: eval time = 4323.84 ms / 85 runs ( 50.87 ms per token, 19.66 tokens per second)
llama_print_timings: total time = 4508.76 ms
Log end

root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is RNN better than CNN for training a language translation model? "
Log start
main: build = 1695 (925e558)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1703514287
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 10.3
llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
llama_model_loader: - kv 5: phi2.block_count u32 = 24
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - type f32: 147 tensors
llama_model_loader: - type f16: 98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.42 B
llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
llm_load_print_meta: general.name = Phi2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.09 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used = 2706.37 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 111.19 MiB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

Question: Why is RNN better than CNN for training a language translation model?
Answer: RNN can handle sequential input and output, making it suitable for learning from text data. It also has the capacity to retain information over long sequences of characters or words.

[end of text]

llama_print_timings: load time = 100.15 ms
llama_print_timings: sample time = 5.85 ms / 39 runs ( 0.15 ms per token, 6668.95 tokens per second)
llama_print_timings: prompt eval time = 140.10 ms / 17 tokens ( 8.24 ms per token, 121.34 tokens per second)
llama_print_timings: eval time = 1926.63 ms / 38 runs ( 50.70 ms per token, 19.72 tokens per second)
llama_print_timings: total time = 2084.10 ms
Log end

However, pricing dosen't change, its still INR 6600/month

vsys_host · December 2023

We've got GPU dedicated servers plans! Take a look

Also, here you can find GPU dedicated servers on sale:
https://vsys.host/gpu-servers-sale

There is an option to inform us about the GPU dedicated server setup or solution you need, and we'll customize it specifically for you. Crafting personalized solutions is our specialty and what we find most fulfilling!

Capable of mining, video encoding/decoding, and data science tasks and more
Tailored configurations for GPU servers, single or dual GPUs
Reliable and robust hardware for high-performance computing
Accepting Crypto payments for GPU server hosting
Enjoy 1 Gbps unlimited bandwidth
Rapid deployment within less than 48 hours

PRO GPU
CPU: E5-2670V3 (12×2.3GHZ)
GPU: GeForce GTX 1080 Ti
RAM: 64GB DDR4
DRIVE: 250Gb SSD
PORT: 1Gbps
IPV4/IPV6: /32 /64
ROOT ACCESS: SSH, Panel
IPMI: v.2
FREE SETUP: ✓
24×7 SUPPORT: ✓
OLD PRICE: $299/ month
DISCOUNTED PRICE: $200 / month
CONFIGURE AND BUY SERVER

INCREDIBLE GPU
CPU: E5-2670V3 (12×2.3GHZ)
GPU: 2 x GeForce GTX 1080 Ti
RAM: 128GB DDR4
DRIVE: 250Gb SSD
PORT: 1Gbps
IPV4/IPV6: /32 /64
ROOT ACCESS: SSH, Panel
IPMI: v.2
FREE SETUP: ✓
24×7 SUPPORT: ✓
OLD PRICE: $399 / month
DISCOUNTED PRICE: $300 / month
CONFIGURE AND BUY SERVER

SUPERIOR GPU
CPU: Dual E5-2670V3 (24×2.3GHZ)
GPU: 2 x GeForce RTX 3080
RAM: 128GB DDR4
DRIVE: 250Gb SSD
PORT: 1Gbps
IPV4/IPV6: /32 /64
ROOT ACCESS: SSH, Panel
IPMI: v.2
FREE SETUP: ✓
24×7 SUPPORT: ✓
OLD PRICE: $499 / month
DISCOUNTED PRICE: $350 / month
CONFIGURE AND BUY SERVER

INCREDIBLE GPU +
CPU: Dual E5-2670V3 (24×2.3GHZ)
GPU: 2 x GeForce RTX 3080 Ti
RAM: 128GB DDR4
DRIVE: 250Gb SSD
PORT: 1Gbps
IPV4/IPV6: /32 /64
ROOT ACCESS: SSH, Panel
IPMI: v.2
FREE SETUP: ✓
24×7 SUPPORT: ✓
OLD PRICE: $599 / month
DISCOUNTED PRICE: $400 / month
CONFIGURE AND BUY SERVER

bh4tech · December 2023

@sreekanth850 is not ready to take my server with 16GB VRAM at ~$80/month, I don't think he will take yours with 11GB VRAM at $200/month. I understand that your GPU will be faster but even then you need VRAM to keep the model in memory while running (specially when demand grows and you need to run 2-3 LLMs in parallel).

I guess he is starting small and will stick with API providers.

sreekanth850 · December 2023

@bh4tech said:
@sreekanth850 is not ready to take my server with 16GB VRAM at ~$80/month, I don't think he will take yours with 11GB VRAM at $200/month. I understand that your GPU will be faster but even then you need VRAM to keep the model in memory while running (specially when demand grows and you need to run 2-3 LLMs in parallel).

I guess he is starting small and will stick with API providers.

Yes. also. But i will keep this in my mind. Pay as you go will be viable for prototyping.

crunchbits · December 2023

@sreekanth850 said:

@sh97 said:
@crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

Thanks @sh97

Deepinfra is well priced for an end user turnkey solution. I like their approach. We're building some ISOs now for "one-click deployment" of popular open-source LLM's as I didn't realize how 'tricky' some of them are to get going, especially if you're relatively new to them. Don't have as much time as a team to mess around with them 'outside of work' as we used to, so it's always useful feedback.

What LLMs would you/anyone be interested in being able to quickly deploy?

commercial · December 2023

About DeepInfra, feel free to take advantage of the F6S deal for a 150h free test
https://www.f6s.com/company-deals/deepinfra/150h-free-ai-ml-models-by-api-14180

sreekanth850 · December 2023

@crunchbits said:

@sreekanth850 said:

@sh97 said:
@crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

Thanks @sh97

Deepinfra is well priced for an end user turnkey solution. I like their approach. We're building some ISOs now for "one-click deployment" of popular open-source LLM's as I didn't realize how 'tricky' some of them are to get going, especially if you're relatively new to them. Don't have as much time as a team to mess around with them 'outside of work' as we used to, so it's always useful feedback.

What LLMs would you/anyone be interested in being able to quickly deploy?

Mistral, llama, and Falcon is good to go with imo.

PUSHR_Victor · December 2023

@sreekanth850 said:
Mistral, llama, and Falcon is good to go with imo.

Falcon will probably end up being expensive: https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/4b5cfaa8bc5c8af982fb545c1f832e3541683aef/LICENSE.txt#L62

Not sure about the other 2

sreekanth850 · December 2023

@PUSHR_Victor said:

@sreekanth850 said:
Mistral, llama, and Falcon is good to go with imo.

Falcon will probably end up being expensive: https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/4b5cfaa8bc5c8af982fb545c1f832e3541683aef/LICENSE.txt#L62

Not sure about the other 2

I thought it was Apache 2 License.

priest · December 2023

Mistral

sreekanth850 · January 18

Finally, I had settled with Deepinfra and Together AI which provides more than 50+ opensource models. Deepinfra served my purpose well, and together AI kept as a backup.

Howdy, Stranger!

Categories

In this Discussion

Any LET providers plan to focus on AI domain - Hosting Open Source LLM's

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Any LET providers plan to focus on AI domain - Hosting Open Source LLM's

Comments