A media server in a WebRTC infrastructure plays a critical role in scaling a WebRTC call beyond 4 participants. Whenever you join a call that has 8-10 participants or more, know that a media server is doing the hard work behind the scene to provide you with a smooth audio/video experience. If you have a need for building a WebRTC infrastructure and you need to select a WebRTC media server for your use case, then this post is going to help you with enough information to take an informed decision.

Why and When a WebRTC Media Server is required?

A WebRTC Media Server is a critical piece of software that helps a WebRTC application distribute audio/video streams to all the participants of an audio/video meeting. Without them, creating a large audio/video call beyond 4 users would be a highly difficult task due to the nature of WebRTC calls. WebRTC calls are designed for real-time use cases (<1 second of delay between the sender and receiver of an audio/video stream). In this case, a user sending his/ her audio/video streams has to send the streams to all the participants who are joining the conference for viewing it in real-time, so that a real conversation can happen. Imagine a call with 10 people, where everybody is sending his / her audio/video stream to rest 9 people(other than himself/herself) so that they can view it in real time. Let's do some maths to find out some interesting details.

When a user joins an audio-video call that is running on WebRTC, he/she can share either audio/video/screen or all of them together.

If joined only with audio: ~40Kbps of upload bandwidth is consumed

if joined with only video: ~ 500Kbps of upload bandwidth is consumed

if joined with only screen share: ~ 800 Kbps of upload bandwidth is consumed

if all 3 are shared together : ~1340Kbps or 1.3Mbps of upload bandwidth is consumed

If there are 10 people in the meeting, then 1.3 * 9 = 11.7 Mbps of upload bandwidth will be consumed every second! Remember that you need to send your audio/video/screen-share or all of them together to everybody else except yourself. Anybody who doesn't have a consistent 11.7Mbps bandwidth, can't join this meeting!

This also brings another challenge for the device being used by the user to join the conference. The CPU of the device has to work very hard to compress and encode the audio/video/screen share video streams to send over the network as data packets. If the CPU has to spend 5% of its capacity to compress and encode the users audio/video/screen-share streams to send it to another user who has joined the meeting, then it has to spend 9 * 5 = 45% of its efforts to compress, encode, and send the user's audio/video/screen-share streams to rest 9 participants.

Is the CPU not wasting its efforts by trying to do the exact same thing 9 times in this case?

Can we not compress, encode, and send just the user's audio/video/screen-share streams

once to the cloud and the cloud does some magic to replicate the audio/video/screen-share streams of that user and send it to everybody else present in the same meeting room!

Yes we possibly can do this magic and the name of this magic is Media Server!

Different kinds of WebRTC Media Servers, MCU vs. SFU

Primarily there are 2 kinds of Media servers. One is a SFU and another is a MCU.

According to the last example, now we know that we need a media server that can replicate and distribute the streams of a user to as many people as needed without wasting the user's network and CPU capacity. Let's take this example forward.

There is a situation, where the meeting needs to support various UI layouts with a good amount of configuration options regarding who can view and listen to whom! It turns out that this is going to be a virtual event with various UI layouts like Stage, backstage, front-row seats, etc. Here the job of the media server is to replicate and distribute the streams to everybody else except the user himself/herself. Therefore in this case of a 10-user virtual event, every user will be sending only his / her streams to the media server once and receiving the streams from everybody else as individual streams. This way, the event organizer can create multiple UI layouts for viewing by different users according to the place they currently are in, i.e. the backstage/ stage / front row. In this situation, the SFU is helping us by sending all the streams as individual audio/video streams without forcing the way they should be displayed to an individual user. In an SFU, though the user sends only his/her audio/video/screen-share streams it receives from everybody else as individual streams which consumes download bandwidth based on the number of participants. the more the number of participants, the more the download bandwidth is consumed!

Now let's take a different situation of a team meeting of 10 users of an organization who don't need much dynamism in the UI but are happy with the usual Grid layout of videos. In this situation, we can merge the audio and video streams of all other participants except himself/herself in the server and create one audio/video stream which can then be sent to all other participants. Here, all the users will send their own audio/video stream and receive all others' combined audio/video stream(Only one stream!) in a fixed layout as created by the server. The UI will just show one video which was sent by the server as the combined video element. Here MCU is helping us do our job neatly. In this situation, the download bandwidth consumption will be consistent irrespective of the number of users joining the meeting as every user will receive only one audio/video stream from the server. The 2 major downside of this approach is the number of servers needed to create a combined video of all users would be much higher than just replicating and sending the approach of an SFU and rigid UI layout which is already decided by the server without the UI having any control over it.

Two of the largest global video conferencing services use one of the approaches described above.

Gmeet : SFU

MS Teams: MCU

SFUs are slowly gaining more popularity due to the amount of flexibility they provide in creating UI layouts which is highly important for an engaging user experience and takes much lesser servers to cater to a large number of users as compared to an MCU. We are going to discuss the most popular SFUs available out there today and how to choose one for your next WebRTC Media Server requirement.

How to Choose a WebRTC Media Server for your next requirement?

In this section, we are going to discuss the top open-source media servers currently available out there and how they perform against each other. Here, I am going to discuss those media servers which use WebRTC/ openRTC as their core implementation. I won't be covering the media servers built on PION, the go implementation of WebRTC as that needs a different post.

We would be discussing some of the key things about the below media servers.

  1. Jitsi Video Bridge(JVB), Jitsi (SFU)
  2. Kurento (SFU + MCU)
  3. Janus (SFU)
  4. Medooze (SFU + MCU)
  5. Mediasoup(SFU)

We would primarily be discussing the performance of each media server along with its suitability for building a WebRTC infrastructure.

Jitsi Video Bridge(JVB), Jitsi

Jitsi is a very popular open-source video conferencing solution available out there today. It is so popular because it provides a complete package for building a video conferencing solution including a web & mobile UI, the media server component which is JVB along with some required add-ons like recording and horizontal scalability out of the box. It has very good documentation as well which makes it easy to configure it on a cloud like AWS.

Kurento

Kurento used to be the de facto standard for building WebRTC apps for the promises it made to the WebRTC developers with its versatility(SFU + MCU) and OpenCV integration for real-time video processing way back in 2014. But after the acquisition of Kurento and its team by Twillio in 2017, the development has stopped and now it's in maintenance mode. One can understand that it is not so great now from the fact that the current team which is maintaining Kurento has a freemium offering named OpenVidu which uses mediasoup as its core media server!

Janus

Janus is one of the most performant SFUs available out there with very good documentation. It has a very good architecture where the Janus core does the job of routing and allows various modules to do various jobs including recording, bridging to SIP/PSTN, etc. It is being updated regularly by its backer to keep it up-to-date with the latest WebRTC changes. This can be a choice for building a large-scale Enterprise RTC application which needs a good amount of time and resource investment for building the solution. The reason is that it has its own way of architecting the application and can't be integrated as a module into a large application like mediasoup.

Medooze

Medooze is more known for its MCU capabilities than SFU capabilities though its SFU is also a capable one. Though it is a performant media server, it lacks in the documentation side which is key for open source adoption. It was acquired by Cosmo Software in 2020 after which Cosmo Software has been acquired by Dolby. This can be your choice if you are a pro in WebRTC and know most of the stuff by yourself. From Github commits it seems that it is still in active development but it still needs good effort in the documentation side.

Mediasoup

Mediasoup is a highly performant SFU media server available today with detailed documentation and it is backed by a team of dedicated authors with a vibrant open source community and backers. the best part is that it can be integrated into a large Nodejs / Rust application as a module to let it do its job as part of a large application. It has a super low-level API structure which enables developers to use whatever/however they need to use it inside their application. Though it needs a good amount of understanding to build a production-ready application that is beyond the demo provided by the original authors, it is not that difficult to work with it if one is passionate and dedicated to learning the details.

Below is a set of exhaustive performance benchmarking tests done by Cosmo Software people back in 2020 at the height of COVID when WebRTC usage was going beyond the roof to keep the world running remotely. Below are the important points from the test report that are needed to be considered. The whole test report can be found at the bottom of this post for people interested to know more.

Testing a WebRTC application needs to be done with virtual users which actually are cloud VMs joining a meeting room as a test user performing a certain task/tasks. In this case, the test users aka cloud VMs joined using the below-mentioned configuration. In this case, all the above servers were hosted as a single instance server using a VM as described below.

The next is load parameters which were used to test each of these media servers. The numbers are not the same for all these media servers as the peak load (after which a media server fails!) capacity is not the same for every one of these. Here these peak load numbers of each media server have been derived after a good amount of DRY runs.

The test result of the load test.

  • Page loaded: true if the page can load on the client side which is running on a cloud VM.
  • Sender video check: true if the video of the sender is displayed and is not a still or blank image.
  • All video check: true if all the videos received by the six clients from the SFU passed the video check which means every virtual client video can viewed by all other virtual clients.

There are other important aspects of these media servers like RTT(Round Trip Time), Bitrates and overall video quality.

The Bitrate is directly responsible for video quality. It simply means how many media stream data packets are being transmitted in real time. the higher the bitrate the better is the image quality but the higher the load on the network to transmit and on the client side CPU to decode. Therefore, it is always a balancing act tp trying to send as many data packets aka the bitrate as possible without congesting the network or overburdening the CPU. Here a good media server can play a good role with techniques like Simulcast / SVC to perosnalise the bitrate for each individual receiver based on their network and CPU capacity.

The RTT is an important parameter which tells that how fast a a media stream data aka RTP packet is delivered over the real time network conditions. The lower the RTT the better it is.

As it tells, this is the video quality being transmitted by the media server in various load patterns. The higher the quality the better it is.

I hope I was able to provide a brief description of each media server with a enough data points so that you can make a good decision in choosing the media server for your next video project. Feel free to drop me an email at sp@centedge.io if you need any help with your selection process or with video infrastructure development process. We have a ready to use cloud video infrastructure built with mediasoup media server which can take care of your scalable video infra needs and let you focus on your application and business logic. You can have an instant video call/ scheduled video call with me using this link for discussing anything related to WebRTC/media servers/ video conferencing/live streaming etc.

PS: Here is the link to the full test report if anybody is interested in reading the whole of it which has a detailed description of this load test along with many interesting findings.