Ngaba unayo iintlawulo ezininzi ze-LLM kwi-Data Transformation flow yakho? ingakwazi ukunceda. Izixhobo ye-Ultra-performance Rust injini kwaye ngoku inikeza i-batching ye-out-box ye-adaptive. Oku iye yandisa i-Throughput ngexesha le-5x (i-80% ngexesha lokugqibela) kwiinkqubo zokusebenza ze-AI. Kwimveliso engaphezulu, ungenza ukuguqulwa kwe-code njengoko i-batching isebenza ngokuzenzakalelayo, ukuguqulwa kwe-traffic yakho kunye nokugcina i-GPU ezisetyenziswa ngokupheleleyo. CocoIndex Nazi yinto ethandwa ngexesha lokufunda inkxaso ye-adaptive batching kwi-Cocoindex. Kodwa kuqala, siyajongezela kwiziphi na imibuzo ezininzi ezinokufuneka kwi-mixnd yakho. Yintoni i-batching ikhawuleza ukuvelisa? This consists of all the preparatory and administrative work required before the actual computation can begin. Examples include GPU kernel launch setup, Python-to-C/C++ transitions, scheduling of tasks, memory allocation and management, and bookkeeping performed by the framework. These overhead tasks are largely independent of the input size but must be paid in full for each call. Fixed overhead per call: This portion of the computation scales directly with the size and complexity of the input. It includes floating-point operations (FLOPs) performed by the model, data movement across memory hierarchies, token processing, and other input-specific operations. Unlike the fixed overhead, this cost increases proportionally with the volume of data being processed. Data-dependent work: Xa iimveliso zihlanganiswa ngalinye, i-overhead ye-fixed ifumaneka kwiimveliso ezininzi, leyo inokufumana ngokushesha i-runtime epheleleyo, ikakhulukazi xa i-computing ye-per-item ifumaneke ngokufanelekileyo. Ngokungafani, ukusetyenziswa kweemveliso ezininzi kunye neempawu ezininzi kwiimveliso ezininzi kukunciphisa kakhulu imiphumo ye-per-item ye-overhead yayo. I-batching ivumela ukuba iimveliso ze-fixed zithunyelwe kwiimveliso ezininzi, kwaye ukunceda iimveliso ze-hardware kunye ne-software ekuphuculeni i-efficiency ye-data-dependent work. Ezi iimveliso zihlanganisa ukusety I-batching ivumela ukwandisa ukusebenza ngokubanzi ngokuphathelene nokusetyenziswa kwe-computational efficiency kunye ne-resource utilization. I-batching inikeza izinzuzo ezininzi kunye ne-compounding: Each function or API call carries a fixed overhead — GPU kernel launches, Python-to-C/C++ transitions, task scheduling, memory management, and framework bookkeeping. By processing items in batches, this overhead is spread across many inputs, dramatically reducing the per-item cost and eliminating repeated setup work. Amortizing one-time overhead: Larger batches allow the GPU to execute operations as dense, highly parallel matrix multiplications, commonly implemented as General Matrix–Matrix Multiplication (GEMM). This mapping ensures the hardware runs at higher utilization, fully leveraging parallel compute units, minimizing idle cycles, and achieving peak throughput. Small, unbatched operations leave much of the GPU underutilized, wasting expensive computational capacity. Maximizing GPU efficiency: Batching minimizes the frequency of memory transfers between CPU (host) and GPU (device). Fewer Host-to-Device (H2D) and Device-to-Host (D2H) operations mean less time spent moving data and more time devoted to actual computation. This is critical for high-throughput systems, where memory bandwidth often becomes the limiting factor rather than raw compute power. Reducing data transfer overhead: Ukusetyenziswa kwe-batching, i-batching yenza iinkcukacha ezininzi ezincinane, ezininzi ezisetyenzisiweyo kwizinkqubo ezininzi ezininzi ezininzi ezisetyenzisiweyo ezisetyenziswa ngokupheleleyo iinkcukacha ze-hardware ezintsha. Kwiinkqubo ze-AI - kubandakanya iimodeli ezininzi ze-language, i-computer vision, kunye ne-processing ye-data ye-real-time - i-batching ayikho kuphela ukunciphisa; kunzima ukufumana ukusebenza okuphumelela kwinqanaba le-production. Yintoni i-batching ibonakala ngokuba yi-code ye-Python ebonakalayo I-Non-batching code – elula kodwa engapheliyo Umgangatho okwenziwe kakhulu ukuhanjiswa kwe-pipeline kuyinto ukuhanjiswa kwedatha imizuzu. Umzekelo, i-lope ye-2-layer efana ne: for file in os.listdir(directory): content = file.read() chunks = split_into_chunks(content) for chunk in chunks: vector = model.encode([chunk.text]) # one item at a time index.upsert(file_id=file.name, chunk_offset=chunk.offset, vector=vector) Yinto elula ukucacisa kunye nokuxhomekeka: zonke iingxaki zithunyelwa ngqo ngeengxaki ezininzi. Ukuphatheleka ngamanzi – engaphezulu kodwa engapheliyo Uyakwazi ukuhlaziywa nge-batching, kodwa i-version elula ye-"just batch everything once" ibonise ikhowudi kakhulu: # 1) Collect payloads and remember where each came from batch_texts = [] metadata = [] # (file_id, chunk_id) for file in os.listdir(directory): content = file.read() chunks = split_into_chunks(content) for chunk in chunks: batch_texts.append(chunk.text) metadata.append((file.name, chunk.offset)) # 2) One batched call (library will still mini-batch internally) vectors = model.encode(batch_texts) # 3) Zip results back to their sources for (file_name, chunk_offset), vector in zip(metadata, vectors): index.upsert(file_id=file.name, chunk_offset=chunk.offset, vector=vector) Ngaphezu kwalokho, ukupakisha yonke into ngexesha elifanelekileyo ngenxa yokuba iingxaki ezilandelayo ziyaqala kuphela emva kokuba le ngxaki kwenziwe zonke iinkcukacha. Ukusetyenziswa kwe-batching ye-CocoIndex I-CocoIndex inikeza ingxaki kwaye inokukwazi ukufumana i-best yeeyunithi ezimbini - ukugcina i-simplicity ye-code yakho ngokulandelana ne-flow ye-natural, nangokufumana i-efficiency ye-batching eyenziwa yi-CocoIndex runtime. Kwakhona i-batching ye-support yeempawu ezilandelayo ezihlangeneyo: Ukucinga Ukucaciswa Ukucinga Ukucinga Ngaba awukwazi ukuguqulwa i-fire. Your existing code will just work without any change – still following the natural flow, while enjoying the efficiency of batching. Ukusetyenziswa kwimfuneko yeenkcukacha, ukulungiselela i-batching yinto elula njengoko: Set batching=True kwi-custom function decorator. Ukuguqula i-arguments kunye nokuguqula i-type kwi-list. Umzekelo, ukuba ufuna ukuvelisa isicelo esixhasiweyo esitholisa i-API ukuvelisa iifoto ezincinane. @cocoindex.op.function(batching=True) def make_image_thumbnail(self, args: list[bytes]) -> list[bytes]: ... Shiya i-batching documentation ukufumana iinkcukacha ezininzi. Yiba Kwiimeko ezininzi. Ukucaciswa kweDokumentation Indlela I-CocoIndex Batches Ukuqhagamshelwano Ukusebenza kwe-batching ngokubonisa iingxaki zangaphakathi kwihlabathi kunye nokukhetha ixesha elifanelekileyo ukuchithwa kwe-batch elinye. Le ixesha kubalulekile - ufumane ngokufanelekileyo, kwaye ufumane i-transput, i-latency, kunye ne-resource usage yonke ngexesha elinye. Two widely used batching policies dominate the landscape: In this approach, the system flushes all requests that arrived within a fixed window of W milliseconds. Time-based batching (flush every W milliseconds): The maximum wait time for any request is predictable, and implementation is straightforward. It ensures that even during low traffic, requests will not remain in the queue indefinitely. Advantages: During periods of sparse traffic, idle requests accumulate slowly, adding latency for early arrivals. Additionally, the optimal window W often varies with workload characteristics, requiring careful tuning to strike the right balance between latency and throughput. Drawbacks: Here, a batch is triggered once the queue reaches a pre-defined number of items, K. Size-based batching (flush when K items are queued): The batch size is predictable, which simplifies memory management and system design. It is easy to reason about the resources each batch will consume. Advantages: When traffic is light, requests may remain in the queue for an extended period, increasing latency for the first-arriving items. Like time-based batching, the optimal K depends on workload patterns, requiring empirical tuning. Drawbacks: Iinkqubo ezininzi ze-high-performance zihlola a : zibonisa i-batch xa i-time window W ifumaneka okanye i-chariot ifumaneka kwinqanaba le-K - nto leyo ilandelayo kuqala. Le nkqubo ithathwe izinzuzo zombini iindlela, ukwandisa ukuphendula ngexesha le-traffic ebandayo kwaye ukugcina i-batch size efanelekileyo ngexesha le-peak load. hybrid approach Nangona kunjalo, i-batching iveza Iimveliso zeTraffic, iimveliso zokusebenza, kunye neengxaki ze-system zihlanganisa izicwangciso ezifanelekileyo. Ukuphumelela kwimveliso efanelekileyo kunokuba kufuneka i-monitoring, i-profiling, kunye ne-adjustment ezininzi zeeparametres ukuze zihlanganise nezimo ze-real-time. tunable parameters and trade-offs Umgangatho weCocoIndex Umgangatho weFramework: Adaptive, Knob-free I-CocoIndex isebenza a leyo ifumaneka ngokuzenzakalelayo kwi-request load. I-process isebenza ngokutsho: simple and natural batching mechanism While the current batch is being processed on the device (e.g., GPU), any new incoming requests are not immediately processed. Instead, they are . This allows the system to accumulate work without interrupting the ongoing computation. Continuous queuing: queued I-Automatic Batch window: Xa i-batch elidlulileyo, i-CocoIndex ihamba ngokuzenzakalelayo zonke iimfuno ezihlangene kwihlabathi kwaye ibhalisele njenge-batch elandelayo. Le set yeemfuno yenza i-batch window entsha. I-system uqala ukuxhaswa le batch ngokuzenzakalelayo. I-Adaptive Batching: Akukho i-timers, akukho i-batch sizes, kwaye akukho i-preconfigured thresholds. I-batch ye-batch yenzelwe ngokwemvelo kwi-traffic ebonakalayo ngexesha le-service ye-batch eyadlulayo. I-traffic ephezulu yenza i-batches ezininzi ngokuzenzakalelayo, ukwandisa ukusetyenziswa kwe-GPU. I-traffic ephezulu yenza i-batches ezincinane, ukunciphisa i-latency kwi-requests ezidlulileyo. Umgangatho we-batching ye-CocoIndex Ukusebenza ngokuzenzekelayo iimfuno kwi-batches kwaye ibonelela ubungakanani ye-batch ukuba ibonise iimfuno kwi-real-time, ukufumana i-high-throughput ngaphandle kokufuneka i-tuning ye-manual okanye i-heuristics eqinile. In essence, self-tuning Yintoni oku kuhle? I-Latency ye-low when scarce: Nge-requests ezincinane, i-batches i-small (ngaphezulu kwe-1), ngoko ufumana ngokugqithisileyo kwi-latency malunga ne-single-call. I-High throughput when busy: Xa i-traffic isixeko, izicelo ezininzi zihlanganisa ngexesha le-in-flight batch, ngoko i-batch elandelayo ikakhulu - ukusetyenziswa kwandisa ngokuzenzakalelayo. No tuning: Akukho kufuneka ukulungiselela W okanye K. I-system ifumaneka kwi-traffic pattern yakho ngokucacileyo. Function-level batching: packing the batch intelligently I-function-level batching: Ukupakisha i-batch ngokugqithisileyo Kwi-function level, i-CocoIndex inikeza zonke i-function ukufumana i-batch window - zonke iimfuno ezijoliswe ngexesha i-batch eyadlulayo - ngexesha elifanelekileyo kunye nokhuseleko kwimodeli okanye i-library yayo elifanelekileyo. I-framework ibonelela i-batch ngokushesha, kodwa , leyo ivumela ukufikelela kunye nokusebenza okungenani. how it’s processed is up to the function Yenza i ifumaneka njengesibonelo. I-phrase-transformer library esisiseko inokukwazi ukufumana i-batches ye-length ye-arbitrary, kodwa kwi-internally i-batches zihlanganisa (Uhlobo lwe-default: 32) ukuqinisekisa ukuba ngamnye iingxowa ngokugqithisileyo kwi-memory ye-device nangokugcina i-GPU kernels kwi-optimum "i-sweet spot." SentenceTransformerEmbed micro-batches I-batching ayikho kuphela ukulungiselela idatha kwi-memory - nayo kuquka ukunciphisa i-computing embalwa. I-Transformer runtimes ikakhulukazi , okuvumela i-GPU ukuba isebenze i-core ye-unique, ye-high-throughput. Nangona kunjalo, oku kuthetha ukuba i-sequences ezincinane zithatha iindleko ye-sequence elide ye-batch. Umzekelo, ukuxuba i-64-token kunye ne-256-token iimpawu i-64-token ifumaneka ~4x ngexabiso engaphezulu kunokuba kufuneka. I-CocoIndex ilungiselele oku ngokufanelekileyo kunye nokufunda i-micro-batches yeengqungquthela efanayo, ukunciphisa i-padding overhead kunye nokugcina ukusetyenziswa kwe-GPU ephezulu. pad every sequence in a batch to the length of the longest sequence sorting requests by token count Iimpawu ezininzi ziyavumela iinkqubo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo zayo ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye zibe. like SIMD ifayile okanye merge-writes. CocoIndex ibekwe agnostic ukuba indlela — inkonzo yayo , ukunika zonke iimpawu ukulawula ngokupheleleyo indlela yokwandisa ukufikelela kunye nokunciphisa i-overhead. custom packing schemes deliver the batch window efficiently and without delay I-Design ye-Equilibrium I-framework ithatha i-orchestrating ye-batching, kwaye i-functions zayo zithembisa ukusuka kwimemori, i-computing, kunye ne-kernel efficiency - ukunika i-high throughput kwiintlobo zokusebenza ezahlukeneyo ngaphandle kokuchaza isisombululo se-one-size-fits-all. simplicity, flexibility, and performance Conclusion Ukucinga I-batching yi-one of the most effective strategies for accelerating computational workloads. Ngokuba Ukuvumela Yaye , batching ukuguqulwa into ezininzi ezincinane, iinkcukacha ezincinane kwiintlobo ezincinane, kakhulu zokusetyenziswa. amortizing fixed overhead across multiple items larger, more efficient GPU operations minimizing data transfer I-CocoIndex yenza i-batching Iimpawu ezininzi ezihlangeneyo ziye zithintela i-batching phantsi kwe-cap, kwaye iimpawu ezihlangeneyo zithintela ngexabiso elula I-decorator. Oku ukunciphisa i-complexity ye-manual management of queues, i-timers, okanye i-batch sizes, okuvumela i-developers ukunceda kwiimodeli zabo kunye ne-applications. effortless and automatic batching=True Izixhobo zokusebenza ze-batching ziyafumaneka kakhulu xa , njenge-models ezincinane okanye iinkqubo ezincinane. I-batching yinto efanelekileyo xa i-API okanye i-library , njengoko ukhuseleko esifunyenwe kunokukwazi ukunciphisa iziphumo - umzekelo, ezinye izibuyekezo ezifana ne-Ollama zibonisa kuphela izibuyekezo ezincinane phantsi kwe-batching. fixed overhead represents a significant portion of total computation fully supports batched operations Ngokutsho, i-batching yinto a : inqakraza i-throughput, ukunciphisa i-latency apho kuxhomekeke, kwaye ivumela i-hardware ukuba isebenze phantsi kweemfuno yayo epheleleyo - yonke nangokufunda i-developer experience efanelekileyo kwaye efanelekileyo. I-CocoIndex inqakraza i-complexity, ibonelela iinzuzo ze-batching ngokuzenzakalelayo kwi-workloads ezininzi. high-leverage optimization Tsalela nathi ngokuvumela i-CocoIndex ⭐ Star kwi-GitHub kunye nokuVavanyelwa kwi-community yakho ukuba uya kuba lula! Thola nathi ngokuvumela Nceda uqhagamshelane kunye nenkampani yakho ukuba uyakwazi! CocoIndex a ⭐ Star on Ukucinga Ukucinga