Remove Fashion Remove Protocol Remove Server
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Journey to Event Driven – Part 2: Programming Models for the Event-Driven Architecture

Confluent

The broker sizing is a function of the total sum of partitions across the number of servers and replicas. Consumers in this context are anything that requests data; they could be stream processors, Java or.NET applications or KSQL server nodes. In some cases, it is desirable to hide these protocol concerns.

article thumbnail

Using Streams Replication Manager Prefixless Replication for Kafka Topic Aggregation

Cloudera Blog

It contains the name (alias), address (bootstrap servers), and credentials that SRM can use to access a specific cluster. The setup in this tutorial is minimal and unsecure, so you only need to configure Name, Bootstrap Servers, and Security Protocol lines. The security protocol in this case is PLAINTEXT.

article thumbnail

NetOps for Application Developers: Understanding the Importance of Network Operations in Modern Development

Kentik

Having an expert perspective on network protocols helps ensure data will be moved securely and with network performance in mind. Cross-functional teams One way to ensure architectural decisions include the perspective of both application and network specialists is to create cross-functional teams.

article thumbnail

Journey to Event Driven – Part 3: The Affinity Between Events, Streams and Serverless

Confluent

Stream processors allow us to work natively with these streams in a “correct” fashion by supporting a myriad of patterns with either bespoke logic, such as Kafka Streams, or a higher order grammar: KSQL. VM, container, server. Stream patterns (fan out, fan in/join). Consumption based. FaaS provider. Runtime limits. 3,008 MB memory.

Cloud 109
article thumbnail

The power of columnar databases for telemetry

SysAdmin1138 Explains

What this means for the future of telemetry and observability Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. NFS shares for Syslog.