Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Any LET providers plan to focus on AI domain - Hosting Open Source LLM's
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Any LET providers plan to focus on AI domain - Hosting Open Source LLM's

Any LET providers plan to focus on AI domain, which needs GPU servers to run Open source models. Industry has a huge shift towards opensource LLM's, I'm wondering if anyone here plan to focus that segment?.

Comments

  • sh97sh97 Member
    edited December 2023

    @crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

  • @sh97 said:
    @crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

    Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

  • M66BM66B Veteran
    edited December 2023

    Maybe you need this?

    https://deepinfra.com/

  • I think c1v hosting labels some of their services as AI, just don't wonder that the site loads a few minutes. Otherwise, you're probably looking at the big providers like Google, Paperspace, AWS and stuff like that.

    Thanked by 1c1vhosting
  • sreekanth850sreekanth850 Member
    edited December 2023

    @M66B said:
    Maybe you need this?

    https://deepinfra.com/

    Wow this is new for me. This seems super cheap and exactly looking something like this. Thanks for sharing,

  • bh4techbh4tech Member
    edited December 2023

    @sreekanth850 Are you planning to use any? If yes, I can provide you with an open source llm(microsoft phi)along with necessary hardware(AMD GPU with 16GB VRAM, not NVIDIA) and software(llama.cpp)to run it at INR 9000/month(payment UPI only).Edit: Price reduced to INR 6600/month. You can check the generation speed and other details from output of two prompts below-

    Question: You are a bank clerk. Write a paragraph about your daily life.
    Answer: As a bank clerk, my daily life revolves around providing excellent customer service and managing transactions efficiently. I start my day by arriving at the bank branch early in the morning to ensure everything is in order before opening for the day. Throughout the day, I assist customers with various banking needs such as account inquiries, deposit withdrawals, and loan approvals. It is crucial to maintain accuracy while handling sensitive financial information and ensuring customer satisfaction. At the end of each day, I reconcile accounts, file reports, and communicate with my team to discuss any outstanding issues or improvements for the upcoming days.
    [end of text]

    llama_print_timings: load time = 155.09 ms
    llama_print_timings: sample time = 22.41 ms / 121 runs ( 0.19 ms per token, 5399.13 tokens per second)
    llama_print_timings: prompt eval time = 257.93 ms / 16 tokens ( 16.12 ms per token, 62.03 tokens per second)
    llama_print_timings: eval time = 8449.64 ms / 120 runs ( 70.41 ms per token, 14.20 tokens per second)
    llama_print_timings: total time = 8778.03 ms
    Log end

    root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is transformers better than RNN for implementing a large language model? "
    Log start
    main: build = 1695 (925e558)
    main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
    main: seed = 1703511638
    ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
    ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
    ggml_init_cublas: found 1 ROCm devices:
    Device 0: AMD Radeon Graphics, compute capability 10.3
    llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv 0: general.architecture str = phi2
    llama_model_loader: - kv 1: general.name str = Phi2
    llama_model_loader: - kv 2: phi2.context_length u32 = 2048
    llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
    llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
    llama_model_loader: - kv 5: phi2.block_count u32 = 24
    llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
    llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
    llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
    llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
    llama_model_loader: - kv 10: general.file_type u32 = 1
    llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
    llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
    llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
    llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
    llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
    llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
    llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
    llama_model_loader: - type f32: 147 tensors
    llama_model_loader: - type f16: 98 tensors
    llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
    llm_load_print_meta: format = GGUF V3 (latest)
    llm_load_print_meta: arch = phi2
    llm_load_print_meta: vocab type = BPE
    llm_load_print_meta: n_vocab = 51200
    llm_load_print_meta: n_merges = 50000
    llm_load_print_meta: n_ctx_train = 2048
    llm_load_print_meta: n_embd = 2048
    llm_load_print_meta: n_head = 32
    llm_load_print_meta: n_head_kv = 32
    llm_load_print_meta: n_layer = 24
    llm_load_print_meta: n_rot = 32
    llm_load_print_meta: n_gqa = 1
    llm_load_print_meta: f_norm_eps = 1.0e-05
    llm_load_print_meta: f_norm_rms_eps = 0.0e+00
    llm_load_print_meta: f_clamp_kqv = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff = 8192
    llm_load_print_meta: n_expert = 0
    llm_load_print_meta: n_expert_used = 0
    llm_load_print_meta: rope scaling = linear
    llm_load_print_meta: freq_base_train = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx = 2048
    llm_load_print_meta: rope_finetuned = unknown
    llm_load_print_meta: model type = ?B
    llm_load_print_meta: model ftype = F16
    llm_load_print_meta: model params = 1.42 B
    llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
    llm_load_print_meta: general.name = Phi2
    llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
    llm_load_print_meta: LF token = 128 'Ä'
    llm_load_tensors: ggml ctx size = 0.09 MiB
    llm_load_tensors: using ROCm for GPU acceleration
    llm_load_tensors: system memory used = 2706.37 MiB
    llm_load_tensors: offloading 0 repeating layers to GPU
    llm_load_tensors: offloaded 0/25 layers to GPU
    ................................................................................
    llama_new_context_with_model: n_ctx = 512
    llama_new_context_with_model: freq_base = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
    llama_build_graph: non-view tensors processed: 582/582
    llama_new_context_with_model: compute buffer total size = 111.19 MiB

    system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
    sampling:
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampling order:
    CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
    generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

    Question: Why is transformers better than RNN for implementing a large language model?
    Answer: Transformers are more flexible and efficient in handling variable input lengths, which makes them better suited to handle long texts. They can also be trained on diverse text data sets without needing specialized preprocessing steps like sequence padding or tokenization. Moreover, they leverage the power of self-attention mechanisms that RNNs lack.
    [end of text]

    llama_print_timings: load time = 155.69 ms
    llama_print_timings: sample time = 12.06 ms / 68 runs ( 0.18 ms per token, 5637.07 tokens per second)
    llama_print_timings: prompt eval time = 300.38 ms / 18 tokens ( 16.69 ms per token, 59.93 tokens per second)
    llama_print_timings: eval time = 4727.56 ms / 67 runs ( 70.56 ms per token, 14.17 tokens per second)
    llama_print_timings: total time = 5066.72 ms
    Log end

    PS: Not a provider, using the dedicated server for some other tasks but discovered that it works well for LLM inference also. Also, since whole 16GB VRAM is not used, 3-4 LLMs can be run simultaneously as the server used just 30-40% GPU utilisation while inferencing.
    Offer available to not only him but anyone who can make payment via UPI.
    For queries, counter-offer, please DM

  • @bh4tech said:
    @sreekanth850 Are you planning to use any? If yes, I can provide you with an open source llm(microsoft phi)along with necessary hardware(AMD GPU with 16GB VRAM, not NVIDIA) and software(llama.cpp)to run it at INR 9000/month(payment UPI only). You can check the generation speed and other details from output of two prompts below-

    Question: You are a bank clerk. Write a paragraph about your daily life.
    Answer: As a bank clerk, my daily life revolves around providing excellent customer service and managing transactions efficiently. I start my day by arriving at the bank branch early in the morning to ensure everything is in order before opening for the day. Throughout the day, I assist customers with various banking needs such as account inquiries, deposit withdrawals, and loan approvals. It is crucial to maintain accuracy while handling sensitive financial information and ensuring customer satisfaction. At the end of each day, I reconcile accounts, file reports, and communicate with my team to discuss any outstanding issues or improvements for the upcoming days.
    [end of text]

    llama_print_timings: load time = 155.09 ms
    llama_print_timings: sample time = 22.41 ms / 121 runs ( 0.19 ms per token, 5399.13 tokens per second)
    llama_print_timings: prompt eval time = 257.93 ms / 16 tokens ( 16.12 ms per token, 62.03 tokens per second)
    llama_print_timings: eval time = 8449.64 ms / 120 runs ( 70.41 ms per token, 14.20 tokens per second)
    llama_print_timings: total time = 8778.03 ms
    Log end

    root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is transformers better than RNN for implementing a large language model? "
    Log start
    main: build = 1695 (925e558)
    main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
    main: seed = 1703511638
    ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
    ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
    ggml_init_cublas: found 1 ROCm devices:
    Device 0: AMD Radeon Graphics, compute capability 10.3
    llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv 0: general.architecture str = phi2
    llama_model_loader: - kv 1: general.name str = Phi2
    llama_model_loader: - kv 2: phi2.context_length u32 = 2048
    llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
    llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
    llama_model_loader: - kv 5: phi2.block_count u32 = 24
    llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
    llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
    llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
    llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
    llama_model_loader: - kv 10: general.file_type u32 = 1
    llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
    llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
    llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
    llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
    llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
    llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
    llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
    llama_model_loader: - type f32: 147 tensors
    llama_model_loader: - type f16: 98 tensors
    llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
    llm_load_print_meta: format = GGUF V3 (latest)
    llm_load_print_meta: arch = phi2
    llm_load_print_meta: vocab type = BPE
    llm_load_print_meta: n_vocab = 51200
    llm_load_print_meta: n_merges = 50000
    llm_load_print_meta: n_ctx_train = 2048
    llm_load_print_meta: n_embd = 2048
    llm_load_print_meta: n_head = 32
    llm_load_print_meta: n_head_kv = 32
    llm_load_print_meta: n_layer = 24
    llm_load_print_meta: n_rot = 32
    llm_load_print_meta: n_gqa = 1
    llm_load_print_meta: f_norm_eps = 1.0e-05
    llm_load_print_meta: f_norm_rms_eps = 0.0e+00
    llm_load_print_meta: f_clamp_kqv = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff = 8192
    llm_load_print_meta: n_expert = 0
    llm_load_print_meta: n_expert_used = 0
    llm_load_print_meta: rope scaling = linear
    llm_load_print_meta: freq_base_train = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx = 2048
    llm_load_print_meta: rope_finetuned = unknown
    llm_load_print_meta: model type = ?B
    llm_load_print_meta: model ftype = F16
    llm_load_print_meta: model params = 1.42 B
    llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
    llm_load_print_meta: general.name = Phi2
    llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
    llm_load_print_meta: LF token = 128 'Ä'
    llm_load_tensors: ggml ctx size = 0.09 MiB
    llm_load_tensors: using ROCm for GPU acceleration
    llm_load_tensors: system memory used = 2706.37 MiB
    llm_load_tensors: offloading 0 repeating layers to GPU
    llm_load_tensors: offloaded 0/25 layers to GPU
    ................................................................................
    llama_new_context_with_model: n_ctx = 512
    llama_new_context_with_model: freq_base = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
    llama_build_graph: non-view tensors processed: 582/582
    llama_new_context_with_model: compute buffer total size = 111.19 MiB

    system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
    sampling:
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampling order:
    CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
    generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

    Question: Why is transformers better than RNN for implementing a large language model?
    Answer: Transformers are more flexible and efficient in handling variable input lengths, which makes them better suited to handle long texts. They can also be trained on diverse text data sets without needing specialized preprocessing steps like sequence padding or tokenization. Moreover, they leverage the power of self-attention mechanisms that RNNs lack.
    [end of text]

    llama_print_timings: load time = 155.69 ms
    llama_print_timings: sample time = 12.06 ms / 68 runs ( 0.18 ms per token, 5637.07 tokens per second)
    llama_print_timings: prompt eval time = 300.38 ms / 18 tokens ( 16.69 ms per token, 59.93 tokens per second)
    llama_print_timings: eval time = 4727.56 ms / 67 runs ( 70.56 ms per token, 14.17 tokens per second)
    llama_print_timings: total time = 5066.72 ms
    Log end

    PS: Not a provider, using the dedicated server for some other tasks but discovered that it works well for LLM inference also. Also, since whole 16GB VRAM is not used, 3-4 LLMs can be run simultaneously as the server used just 30-40% GPU utilisation while inferencing.
    Offer available to not only him but anyone who can make payment via UPI.
    For queries, counter-offer, please DM

    As we are prototyping something, we cannot afford fully dedicated instance. Planning to start with deep infra and Open AI initially.

  • bh4techbh4tech Member
    edited December 2023

    @sreekanth850 Looking at deepinfra pricing, I realised that I made an overpriced quote. So, reducing it to INR 6600/month. You can easily run 3 PHIs parallely on the machine.

  • I was running some other resource intensive task in the background (about which I had totally forgotten) while generating the earlier outputs and so was getting half performance (~60 tokens/second as compared to the optimal output of ~120 tokens/second). I realised it just now have stopped the background task. Here is the fresh report with no background tasks running-

    Question: Why is transformers better than RNN for implementing a large language model?
    Answer: Transformers are more efficient and require less computational resources to train on large datasets. They can handle long sentences without encountering memory limitations, making them ideal for handling text data in natural language processing tasks like sentiment analysis or machine translation. In contrast, RNNs struggle with large sequences due to the need for recurrent neural networks' complex architectures, which consume significant amounts of memory and time to train effectively on such datasets.

    [end of text]

    llama_print_timings: load time = 102.19 ms
    llama_print_timings: sample time = 13.90 ms / 86 runs ( 0.16 ms per token, 6185.27 tokens per second)
    llama_print_timings: prompt eval time = 145.17 ms / 18 tokens ( 8.06 ms per token, 123.99 tokens per second)
    llama_print_timings: eval time = 4323.84 ms / 85 runs ( 50.87 ms per token, 19.66 tokens per second)
    llama_print_timings: total time = 4508.76 ms
    Log end

    root@WNX0010281:~/llama# ./main -m phi15/ggml-model-f16.gguf -p "Question: Why is RNN better than CNN for training a language translation model? "
    Log start
    main: build = 1695 (925e558)
    main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
    main: seed = 1703514287
    ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
    ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
    ggml_init_cublas: found 1 ROCm devices:
    Device 0: AMD Radeon Graphics, compute capability 10.3
    llama_model_loader: loaded meta data with 19 key-value pairs and 245 tensors from phi15/ggml-model-f16.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv 0: general.architecture str = phi2
    llama_model_loader: - kv 1: general.name str = Phi2
    llama_model_loader: - kv 2: phi2.context_length u32 = 2048
    llama_model_loader: - kv 3: phi2.embedding_length u32 = 2048
    llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 8192
    llama_model_loader: - kv 5: phi2.block_count u32 = 24
    llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
    llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
    llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
    llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
    llama_model_loader: - kv 10: general.file_type u32 = 1
    llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
    llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
    llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
    llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
    llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
    llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
    llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
    llama_model_loader: - type f32: 147 tensors
    llama_model_loader: - type f16: 98 tensors
    llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
    llm_load_print_meta: format = GGUF V3 (latest)
    llm_load_print_meta: arch = phi2
    llm_load_print_meta: vocab type = BPE
    llm_load_print_meta: n_vocab = 51200
    llm_load_print_meta: n_merges = 50000
    llm_load_print_meta: n_ctx_train = 2048
    llm_load_print_meta: n_embd = 2048
    llm_load_print_meta: n_head = 32
    llm_load_print_meta: n_head_kv = 32
    llm_load_print_meta: n_layer = 24
    llm_load_print_meta: n_rot = 32
    llm_load_print_meta: n_gqa = 1
    llm_load_print_meta: f_norm_eps = 1.0e-05
    llm_load_print_meta: f_norm_rms_eps = 0.0e+00
    llm_load_print_meta: f_clamp_kqv = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff = 8192
    llm_load_print_meta: n_expert = 0
    llm_load_print_meta: n_expert_used = 0
    llm_load_print_meta: rope scaling = linear
    llm_load_print_meta: freq_base_train = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_yarn_orig_ctx = 2048
    llm_load_print_meta: rope_finetuned = unknown
    llm_load_print_meta: model type = ?B
    llm_load_print_meta: model ftype = F16
    llm_load_print_meta: model params = 1.42 B
    llm_load_print_meta: model size = 2.64 GiB (16.01 BPW)
    llm_load_print_meta: general.name = Phi2
    llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
    llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
    llm_load_print_meta: LF token = 128 'Ä'
    llm_load_tensors: ggml ctx size = 0.09 MiB
    llm_load_tensors: using ROCm for GPU acceleration
    llm_load_tensors: system memory used = 2706.37 MiB
    llm_load_tensors: offloading 0 repeating layers to GPU
    llm_load_tensors: offloaded 0/25 layers to GPU
    ................................................................................
    llama_new_context_with_model: n_ctx = 512
    llama_new_context_with_model: freq_base = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_new_context_with_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
    llama_build_graph: non-view tensors processed: 582/582
    llama_new_context_with_model: compute buffer total size = 111.19 MiB

    system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
    sampling:
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampling order:
    CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
    generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

    Question: Why is RNN better than CNN for training a language translation model?
    Answer: RNN can handle sequential input and output, making it suitable for learning from text data. It also has the capacity to retain information over long sequences of characters or words.

    [end of text]

    llama_print_timings: load time = 100.15 ms
    llama_print_timings: sample time = 5.85 ms / 39 runs ( 0.15 ms per token, 6668.95 tokens per second)
    llama_print_timings: prompt eval time = 140.10 ms / 17 tokens ( 8.24 ms per token, 121.34 tokens per second)
    llama_print_timings: eval time = 1926.63 ms / 38 runs ( 50.70 ms per token, 19.72 tokens per second)
    llama_print_timings: total time = 2084.10 ms
    Log end

    However, pricing dosen't change, its still INR 6600/month

  • vsys_hostvsys_host Member, Patron Provider

    We've got GPU dedicated servers plans! Take a look o:)

    Also, here you can find GPU dedicated servers on sale:
    https://vsys.host/gpu-servers-sale

    There is an option to inform us about the GPU dedicated server setup or solution you need, and we'll customize it specifically for you. Crafting personalized solutions is our specialty and what we find most fulfilling! <3

    • Capable of mining, video encoding/decoding, and data science tasks and more
    • Tailored configurations for GPU servers, single or dual GPUs
    • Reliable and robust hardware for high-performance computing
    • Accepting Crypto payments for GPU server hosting
    • Enjoy 1 Gbps unlimited bandwidth
    • Rapid deployment within less than 48 hours

    PRO GPU
    CPU: E5-2670V3 (12×2.3GHZ)
    GPU: GeForce GTX 1080 Ti
    RAM: 64GB DDR4
    DRIVE: 250Gb SSD
    PORT: 1Gbps
    IPV4/IPV6: /32 /64
    ROOT ACCESS: SSH, Panel
    IPMI: v.2
    FREE SETUP: ✓
    24×7 SUPPORT: ✓
    OLD PRICE: $299/ month
    DISCOUNTED PRICE: $200 / month
    CONFIGURE AND BUY SERVER

    INCREDIBLE GPU
    CPU: E5-2670V3 (12×2.3GHZ)
    GPU: 2 x GeForce GTX 1080 Ti
    RAM: 128GB DDR4
    DRIVE: 250Gb SSD
    PORT: 1Gbps
    IPV4/IPV6: /32 /64
    ROOT ACCESS: SSH, Panel
    IPMI: v.2
    FREE SETUP: ✓
    24×7 SUPPORT: ✓
    OLD PRICE: $399 / month
    DISCOUNTED PRICE: $300 / month
    CONFIGURE AND BUY SERVER

    SUPERIOR GPU
    CPU: Dual E5-2670V3 (24×2.3GHZ)
    GPU: 2 x GeForce RTX 3080
    RAM: 128GB DDR4
    DRIVE: 250Gb SSD
    PORT: 1Gbps
    IPV4/IPV6: /32 /64
    ROOT ACCESS: SSH, Panel
    IPMI: v.2
    FREE SETUP: ✓
    24×7 SUPPORT: ✓
    OLD PRICE: $499 / month
    DISCOUNTED PRICE: $350 / month
    CONFIGURE AND BUY SERVER

    INCREDIBLE GPU +
    CPU: Dual E5-2670V3 (24×2.3GHZ)
    GPU: 2 x GeForce RTX 3080 Ti
    RAM: 128GB DDR4
    DRIVE: 250Gb SSD
    PORT: 1Gbps
    IPV4/IPV6: /32 /64
    ROOT ACCESS: SSH, Panel
    IPMI: v.2
    FREE SETUP: ✓
    24×7 SUPPORT: ✓
    OLD PRICE: $599 / month
    DISCOUNTED PRICE: $400 / month
    CONFIGURE AND BUY SERVER

  • bh4techbh4tech Member
    edited December 2023

    @sreekanth850 is not ready to take my server with 16GB VRAM at ~$80/month, I don't think he will take yours with 11GB VRAM at $200/month. I understand that your GPU will be faster but even then you need VRAM to keep the model in memory while running (specially when demand grows and you need to run 2-3 LLMs in parallel).

    I guess he is starting small and will stick with API providers.

  • @bh4tech said:
    @sreekanth850 is not ready to take my server with 16GB VRAM at ~$80/month, I don't think he will take yours with 11GB VRAM at $200/month. I understand that your GPU will be faster but even then you need VRAM to keep the model in memory while running (specially when demand grows and you need to run 2-3 LLMs in parallel).

    I guess he is starting small and will stick with API providers.

    Yes. also. But i will keep this in my mind. Pay as you go will be viable for prototyping.

  • crunchbitscrunchbits Member, Patron Provider, Top Host

    @sreekanth850 said:

    @sh97 said:
    @crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

    Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

    Thanks @sh97

    Deepinfra is well priced for an end user turnkey solution. I like their approach. We're building some ISOs now for "one-click deployment" of popular open-source LLM's as I didn't realize how 'tricky' some of them are to get going, especially if you're relatively new to them. Don't have as much time as a team to mess around with them 'outside of work' as we used to, so it's always useful feedback.

    What LLMs would you/anyone be interested in being able to quickly deploy?

    Thanked by 2commercial loay
  • About DeepInfra, feel free to take advantage of the F6S deal for a 150h free test
    https://www.f6s.com/company-deals/deepinfra/150h-free-ai-ml-models-by-api-14180

  • @crunchbits said:

    @sreekanth850 said:

    @sh97 said:
    @crunchbits offers GPU servers already tho they are mostly out of stock always. They are also looking into hourly billing.

    Providing GPU servers is okay, but anything on the top like ready to deploy LLM will be super useful for many startups.

    Thanks @sh97

    Deepinfra is well priced for an end user turnkey solution. I like their approach. We're building some ISOs now for "one-click deployment" of popular open-source LLM's as I didn't realize how 'tricky' some of them are to get going, especially if you're relatively new to them. Don't have as much time as a team to mess around with them 'outside of work' as we used to, so it's always useful feedback.

    What LLMs would you/anyone be interested in being able to quickly deploy?

    Mistral, llama, and Falcon is good to go with imo.

  • PUSHR_VictorPUSHR_Victor Member, Host Rep

    @sreekanth850 said:
    Mistral, llama, and Falcon is good to go with imo.

    Falcon will probably end up being expensive: https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/4b5cfaa8bc5c8af982fb545c1f832e3541683aef/LICENSE.txt#L62

    Not sure about the other 2

  • @PUSHR_Victor said:

    @sreekanth850 said:
    Mistral, llama, and Falcon is good to go with imo.

    Falcon will probably end up being expensive: https://huggingface.co/spaces/tiiuae/falcon-180b-license/blob/4b5cfaa8bc5c8af982fb545c1f832e3541683aef/LICENSE.txt#L62

    Not sure about the other 2

    I thought it was Apache 2 License.

  • Mistral

  • sreekanth850sreekanth850 Member
    edited January 18

    Finally, I had settled with Deepinfra and Together AI which provides more than 50+ opensource models. Deepinfra served my purpose well, and together AI kept as a backup.

    Thanked by 1loay
Sign In or Register to comment.