article thumbnail

Watch Meta’s engineers discuss optimizing large-scale networks

Engineering at Meta

Also, the pivot to metaverse has led to a significant increase in AI, HPC, and machine learning workloads that demand huge networking bandwidth and compute capacity and pose challenges around safe co-existence of existing web, legacy and modern workloads.

article thumbnail

Meta Andromeda: Supercharging Advantage+ automation with the next-gen personalized ads retrieval engine

Engineering at Meta

Unlocking advertiser value through industry-leading ML innovation Meta Andromeda is a personalized ads retrieval engine that leverages the NVIDIA Grace Hopper Superchip, to enable cutting edge ML innovation in the Ads retrieval stage to drive efficiency and advertiser performance.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Logarithm: A logging engine for AI training workflows and services

Engineering at Meta

This powers live training model monitoring dashboards such as an internal deployment of TensorBoard, and is used by ML engineers to debug model convergence issues and training failures (due to gradient/loss explosions) using notebooks on raw telemetry. learning rate), model internal state tensors (e.g., Multimodal data (e.g.,

article thumbnail

Meta’s open AI hardware vision

Engineering at Meta

Networking and bandwidth play an important role in ensuring the clusters’ performance. Our systems consist of a tightly integrated HPC compute system and an isolated high-bandwidth compute network that connects all our GPUs and domain-specific accelerators. Building AI clusters requires more than just GPUs.

Bandwidth 131
article thumbnail

High packet loss when handling multiple high-bandwidth UDP streams due to router MTU fragmentation [closed]

Network Engineering

The sensors and the WiFi chip are able to handle the bandwidth and throughput, I'm not observing any dropped packets from the sender side. What would be the most efficient and effective solution to handle these multiple high-bandwidth UDP streams while minimizing packet loss? I'm currently using a TP Link AXE5400.

Routers 52
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

article thumbnail

Grow your leadership impact as a Tech Lead or Engineering Manager

Asana

We’ve reached a new chapter in our R&D organization – hiring and onboarding more engineering leaders than ever before. All development teams at Asana have a single Engineering Manager (EM) , who is accountable (1) for program success. This guidance is complementary to our engineering career levels guide (our “Success Guide”).