Disruptive Trends in Multimedia Localization, Speech-to-Text, and Video Streaming

September 9, 5:39 PM

UTC

Online

Save your spot

YouTube is the second most popular website on the internet, and according to Hubspot, 3 in every 4 millennials watch at least one video per day. Video has never been more popular. This rise in demand and consumption presents a major challenge for video localisation. The latest trends suggest AI and Machine Learning (ML) will take over the localisation industry. The ever-increasing interest in speech-to-text and live captioning support this. But is it for the greater good? Will it replace human translation or limit the need for human translators? This presentation will talk about how sophisticated AI tools have become, and how advanced they will get in the future. Watch top multimedia localisation insights from the latest of our industry events.

Transcription

Max Morkovkin 00:05
And I think right now we are ready to announce our next speaker Alex Chernenko, CEO at translate. Alex, I have with us from Alex is joining. So Alex will be talking about the latest trends in multimedia localization. And he will also announce the name of the book that he will be giving as a gift to the person with the best question. Hey, Alex. Hi, Max, can you hear me? Okay? Yeah, I can hear you. Okay. As far as I know, you are joining us from a nice Irish city. Limerick.

Alex Chernenko 00:48
That's correct. Well, today is not the noise. We got a rain as usually. How is the weather in Russia today?

Max Morkovkin 00:57
Russia is big. I don't know. But probably, it's okay. So Alex, I think that you are all prepared for presenting the screen and starting to share the best insights about multimedia localization. I'll explain the nonce, what the book what the book is what we will give as a gift.

Alex Chernenko 01:22
The book is Seven Habits of Highly Effective People by Stephen Covey. It's one of my desk books, and I highly recommend it to anyone who hasn't read it.

Max Morkovkin 01:33
Okay, I believe one of the habits is to listen to great interesting presentations. So the floor is yours. Take the mic, and let's start.

Alex Chernenko 01:42
Thank you for having me. So today we're going to talk about disruptive trends and multimedia localization speech to text and video streaming. Well, to tell you a little bit about myself. I am the CEO and founder of translate. So we are an ISO certified language company providing all the classical services but our focus is on the interpreting and Voice services such as voiceover, we have two technologies remote interpreting delivery platform, and we've have a marketplace for interpreters of Westlands. I'm a former localization engineer and tester myself been in the industry for the past 20 years. And I used to speak six languages, believe it or not, now it's only three. So let's look at what's been happening in the world. Before 2020. We enjoyed in person meetings, conferences, live events. And then during 2020, everything went virtual and online people had to learn to work remotely, we had limited or no travel at all. And what that did is parked their use of video services and video content by people who've been seats and toms and lock downs by different businesses and one of the digital media trends survey by Deloitte showed that the streaming video service increased by 23%. And that's that's data is actually a little bit out of date, the number is already higher. Popular video streaming services such as YouTube, saw an increase of 45%, just in the first six months of 2020. And it's surprisingly you will see that actually mobile viewing of YouTube has decreased and the T connected service has increased to more than a third of the consumption. You know, people have been searching for various services, advisors keywords on how to do things and we see this increase of video consumption. If you look further at YouTube statistics, they've been able to grow their premium subscribers from 18 million in 2019 to 30 million last year and of course their revenue grown, or when we look about video consumption and how many people are actually using YouTube. The latest statistics suggest that 74 adults 74% of adults in the US use YouTube and 80% of US base children use YouTube that numbers would be slightly lower outside of us, but there will be still close enough. Other popular video streaming service such as Netflix has seen the same increase in users from 160 7 million in 2019 to over 200 million in last year. And of course, the time spent watching Netflix also increased to more than three hours per person. Other players such as Amazon Prime, video, Apple TV, Disney, they've also seen this trend and they started adding localization because one of the ways of course to grow revenue and reach further audiences is to localize like YouTube is already localized in more than 100 countries and more than 80 languages. So other service providers such as Disney plus they also added localization to grow their users and they launched in April last year, Disney Hotstar in India and they're expected to grow to 60 to 90 million of subscribers. The revenue of the digital media market is expected to grow to over 400 billion by 2025. We also see a trend of the touch list devices of course with the hygiene concerns people tend to use speech recognition devices taking controls at their homes. And we also see conferences and businesses adding devices that would limit the amount of personal touch and the screen. So instead of screens, voice commands are getting popular, it's something that been on the market for a while we Siri, Google Assistant other, but we are seeing that trend increased in 2020, because of the hygiene concerns. And of course, event industry, they've seen this concept of hybrid and the hybrid is now coming to our reality, not just for events, but for other areas, which I'm going to cover in a second. An interesting survey was done by Flipkart that showed that a lot of event organizers like smart card and look from home have moved virtual more than 75% already are adding virtual experience to their attendees. So we are now seeing this trend that you know events that have been only done offline before now adding a second audience of virtual users and this trend will continue. So even the physical events will still offer virtual experience to their attendees. And there are many benefits for event organizers, you'll get more attendees further reach, of course, inclusivity for people who couldn't travel, but now they can enjoy the full experience in this virtual set and using virtual platforms like Remo and many others. And it gives choice now to attend these two speakers to kind of travel or access the event from home. And we are talking about language constant interpreters, they also started to work remotely, which wasn't that popular before 2020. Because of that virtual experience events, we are seeing now pre recorded videos that are shown and smartcard to be able to give you a credit, you're doing all your presentation slides, which which which is you know, is a thumps up. So keep keep up with this. But some other events they use this hybrid approach are mainly pre recorded videos. And to talk about speech recognition, live captioning, and automated subtitles is now a very hot topic. There is many a lot of technology advances that happen in this sector. But we're still asking how accurate is speech to text, can we still trust the machine to be doing it? That most recent benchmark of the transcription services that came from March last year showed that they're still about 15, to 20% error between popular service providers so there is a huge room for improvement. But this technology gets better and better. And every year they get one or 2% extra. So when we look at the speech development sector, we saw an increase of acceptance. So now finally know the speech to tech services are actually getting more recognized and widely adopted by other players. There were two interested two interesting resources one by markets and markets, another one by data breach market, they showed that this industry of speech technology is actually growing at a 20% rate. And in the next couple of years, it's expected to hit close to 4 billion US dollars. In terms of media localization with live captioning, and transcription. Foreign subtitles is something that the audience is using. And you know, it hasn't been popular before. But now we see movies with foreign subtitles actually getting the top Oscars and for example, this movie parasite, it was the first field that was originally not an English but with the help of subtitles, it won four Academy Awards, including the Best Picture Oscar. So we see this trend of subtitles getting across not just from English to other languages, but from other languages to English and getting recognized. Amazon Prime video resources just announced a recent technology that would improve the quality of detection of Subtitles by up to 89% accuracy. So they're actually testing human subtitles and detecting errors. Imagine that this combination with automated transcription, automated translation and the automation quality check would already improved this system to up to 90% quality guaranteed which is amazing. And we're going to see that even improvement over the years. Dubbing Of course, it still remains popular. Some people prefer subtitles, some people prefer to listen to audio, their own language. And again, we see this trend on foreign movies getting recognized and get into the top of Netflix 10 agenda is like the series called looping from France. It's the first actually show that got that high rank. Netflix invested heavily into get original human powered narration services and Tabin. And they still invest in it. But now we have this new technology that really jumped out at us, which is speech synthesis. Essentially, it's computer generated voice that sounds natural and is almost indistinguishable from human. There was research done by University of California Riverside in 2019. That already showed that some of the computer generated voice is indistinguishable to an average human year. Last year that that jumped and at the forefront. We have Have nuance who is already offering firms to construct their own enterprise intelligence system. So you can basically put any text, and you can hear it with all sorts of accents in any language you want. This hasn't been available to us widely before. Now it's coming to masses. It's coming to businesses, it's coming to clients offering that truly multilingual customer service AI power generation. Other companies were smashing voice at built speech recognition algorithms and voice databases by working with people to have a typical speech pattern. So now we are researching into speeches that haven't been popular or widely used. And there is still room for improvement. For developers who want to build intelligent voice assistant, they could use rasa x with some free tools. And of course, this industry requires some sort of standardization. So open voice network is actually an emerging trend that promotes standardization in the speech development engines. Another technology that came to our life and again, boosted in 2020, it's defects. So you know, people already tried to just upload an image, which even a poor quality to video and it will show with your face, this technology has become so advanced, it's now possible to apply to movies. So when you think of it this way, the deep fakes promoted the next development of synthetic lip sync. So where the original character would actually move their lips according to the language. And this leads to synthetic tapping concept. So you just have original actor speaking in the language with the correct lip sync in the target language, which again, hasn't been a widely adopted before. But now we see it already happening to interest in researchers are talking on this subject, one way done by Google, and DeepMind, which is called large scale, multilingual audio visual. And they actually explain how this concept of you know, translating, and then translating the content, then dubbing it and then ensuring the lip sync and another resource towards automated face to face translation by pagefile. And they've presented this lab libgen framework, which shows how audio spectrogram and video frames come together to produce a truly, you know, visual experience where the actor would speak, moving their lips in the target language. A couple of companies that raised significant money last year, and this year included was described was mentioning with their 30 million in fundraising. And they introduced the concept of voice cloning, where you could take anyone's voice and just reproduce it completely with the full lip sync synthesizer raised 12 and a half million US dollars and they offer virtual avatars and also lip sync technology, but you can create any videos or use your existing videos. Third company worth mentioned paper capital raised 11 million they don't offer lip sync yet, but they've been able to automatically dub most of the media channels and they've signed up some some of the good contracts with a BBC Discovery Channel and others for those AI powered video localization. Many other players on the market that you probably are aware of Su digital raised capital BTi students, I Yuna merge and acquire SDI raised 100 16 million link you with startup for former VDI students founders also raised just 1 million. The LUX had two acquisitions from Sony VSI acquired proximo. This is a lot of events and acquisitions and technology advances investment happened in the media localization space, it's a really hot topic to be in, and we're going to see more technology advances because of what's happened and video productions. That will be they need a system to work somewhere to edit the videos and a company worth mentioning Minecraft. They they offer this workflow and cloud based solution for video production. And also, they've been also featured in the video recently, but for anyone who's working without, you know, much worse checking them out. Live Streaming, of course, live streaming is booming, their live streaming increased by at least 34%, just in the last year. And a good example what mentioned turbo turbo, which is one of the Alibaba Group companies. They've live streamed, launched the live streaming project to help farmers survive during the pandemic. And that's in quite a good uptake. Now the streaming technologies are adding localization capabilities. If we look at, you know, Facebook, we look at Instagram, the localization of the live streams isn't there yet, but over the next year or two we're going to see now popular streaming services added capabilities of other languages. And if you look at Twitch statistics, which is one of the most popular streaming services, more than 50% of their streaming is already happening, not an English and we're going to see the trend continues because streaming not in English hasn't been possible before, but I gave it one year before streaming services will add full support of the localization of the live streams. Another trend we see now emerging is live and virtual tours. So you can enjoy the virtual tour in different countries without having And to be there, just by the tour guides. And interestingly enough, games started doing virtual tours in many languages inside. So you get this spiritual experience of walking around a virtual environment. And listening the tour in your own language. events started to offer this virtual navigation environment like this blog post we are using today, it has some sort of virtual capabilities, that we have semi virtual events, but you would still have people appearing on stage with some virtual backgrounds behind them. And finally, this trend is still emerging, it's not there yet. But experiencing the event in a fully virtual concept will come to our days one live probably is going to be maybe two three years from now. But you will see that adopted before the virtual reality will come to life, we will see a concept of augmented reality. Last year, to give you an example, Israeli President addressed his people using a projection. And it was done by a company called omnivore. They're based in Seattle. So you get now this idea of virtual characters, or actually real characters appearing near to you. In a in augmented reality, you know, the, you remember probably the game called Pokemon. So Pokemon was the virtual character, but now it's taken to humans, and avatars. And an event worth mentioning, called virtual been summoned where all attendees were actually virtual characters. So we're seeing this concept of virtual hosts, virtual avatars, virtual characters, actually replacing humans. And you know, take another example, Deepak Chopra, he created his virtual character now to promote meditation, and relaxation. And, you know, it talks with its voice, it looks it has the same mimics and we can see more and more of that coming to our lives. An interesting number, by 2020, artificial humans don't have to have raised more than 320 million. So also this industry together with multimedia localization is attracting a lot of investors. Last year, also the Consumer Electronics Show, Samsung presented humanoid artificial intelligence assistants chatbots, that, again, look like humans. So you have this characters who could be running the events, they could be talking in different languages, and their lips would move exactly. As a human beings, you know, we, that's the move is common, essentially, to our lives. And it's happening already, as we speak. So what I'd like to leave you with is a concept is a concept of hybrid reality. We're talking about localizations, we're talking about multi multimedia, we're talking about video, that humans will be able to interact with virtual characters, humans will be able to interact just beyond the two dimensional screen, you know, now we have this camera where you see my face, I see yours and we are talking in a couple of years, we will be able to interact with each other in the 3d space, even if you're not physically present in the same room. And another example I would like to mention is this character, the video that is currently playing on my screen is called Little Michaela. It's a completely AI powered character that write songs, she has already more than 3 million followers on Instagram, she has like seven songs already published on YouTube. And she has followers, she has virtual, she's thrown out, but she is there. And this is the reality we are currently living in. So what I'd like to ask you, and ourselves how ridiculous it would we are with the news channels that already have this characters concept that you could take a president of any country put his face, make him speak any languages. It makes it fun and cool and interesting. And you can make jokes out of it. But we're already at the stage where we cannot actually distinguish between Is it a real person? It speaks the same voice, it moves the same, but we're already at the point where we cannot distinguish where it's real or words not. And this will even improve further. So is it ethical? What risk does it pose? And, you know, can we do anything about it? So I would like to leave you with this question. I think I went through my slides a little bit quicker than I originally anticipated, but that will give us more time for questions and answers. And if you need any help with interpretation of voiceover, you know, my team at translate will be happy to look after you. And I would be interested to continue the conversations with some of the like minded peers in the industry.

Max Morkovkin 19:49
Like so that's definitely a hot topic. Thank you very much for bringing this one question from my end. How can I be sure that it's you on the By the end?

Alex Chernenko 20:03
That's a good question. Well, I could have pinched you, but not remotely.

Max Morkovkin 20:10
Okay, good. I think we already have the first question. And, guys, I encourage you to go to q&a session, answer your questions. And we will also try to get one question or a couple of questions live, if you raise your hand, and my colleagues will add it to the screen. So and yeah, the prize is very nice book of seven great Habits of Highly Effective People. So let's see what the question is about. It's my Jana, that are the implications of all these technology for linguists and other people involved in localization.

Alex Chernenko 20:53
Or they were talking to interpreters and translators, and we get asked this question a lot is like, will I be out of job? The same question applied to machine translation? Before and translators are still working, the interpreters will have to adapt. And there is a famous quote, in in the interpreter community that says interpreters will, will be not be replaced by technology, but by interpreters using the technology. So they recommend the implications is that people will have to adapt to the technology have to use it. And without it, you know, they will be out of a job. But essentially, now we see the trend of synthetic speech, which means narration, narrators are actually losing some of their jobs dubbing industry, while Netflix is investing in real data and will be part of that will be replaced by the synthesized speech. So of course, some jobs will be lost, as with many AI systems, but at the same time, which will open up new and additional jobs in other sectors.

Max Morkovkin 21:54
Okay, let's move to the next question, then. It's from David DuraTech. David was a speaker on our previous event, or when do we get the first debate cinema streaming service movie with AI adjusted lip sync?

Alex Chernenko 22:10
At the moment, Hey, Alex, the first full movie, I give it a, you know, an estimation that a full movie could be done by the end of this year, we already see short movies like 10 to 12 minutes synthesized, but not a full kind of one and a half, two hour movies. But you know, if not this year, they will be out next year, I'm, I'm quite confident it's gonna happen.

Max Morkovkin 22:35
Okay, the question from Julia? Indeed, this was quite a popular topic in 2020. Would you agree that in 2021, with machine dubbing, we're still far from having automated voices that sound like real human actors, high quality acting,

Alex Chernenko 22:52
that's actually not the case anymore. The syntax of speech has gone to a good level, the only difference now some languages, you can still hear the kind of quick computerized voice but for some languages, it's already done. Like, you know, if you take Spanish, it's quite good. Russian as well. So you know, some languages already went past that stage of being indistinguishable by the human ear. And you know, one of the companies that does full automatic Durbin synching did it for full Discovery Channel platform. So you know, we didn't get the full movies, but there is already like an hour long TV program that has fully synthetic voice. And if you listen to it, for too long, you will hear some hints, but then they're pretty much almost unnoticeable.

Max Morkovkin 23:38
Okay, the third question from Stephen, which of these new technologies will have the most impact on the localization industry in 2022, in your opinion,

Alex Chernenko 23:49
it's a combination, the live captioning and transcription, that's the weakest part at the moment. In also, once you have a text, you can actually synthesize the voice and the text to speech is actually a very good level. So that that technology advances already happened, when the the the speech to text improves, that's when we're going to see the change. But you know, with with a higher mark, it's still a weak point. So I believe the biggest advances will have in speech to text because that's the weakest point. And that's the problem that the market is trying to solve at the moment.

Max Morkovkin 24:26
And we have a question from V. coilover. If the present is interaction through voice assistants speech to text, what do you think is the future for us, which is the next one?

Alex Chernenko 24:40
The future is hybrid, people are longing for physical communication. And while we are thinking, you know, like the guys, it's Marcus for organizing virtual events. I can't wait for the conference to happen. And I believe that you know, the future is actually a choice whether you'd like to experience the virtual event or relates to physical and or have a choice of switch, you know, and that's the kind of the hybrid concept I talks about that event organizers are adding both physical experience and virtual. So we are lucky that you know, we are now getting to choose where we want to travel or or or experience visual concept or not. And next stage would be, how far do we want to go? Is it just screen watching screen, or actually putting the glasses and walking around the conference as you would be there and talking to the people who are actually, you know, a face, but put in a 3d shape? You know, I would like to try that. Because I never, you know, I just tried it at home, but actually talking to another person, where both of us appear spiritual arbitrage in 3d space. It's just mind blowing, in my opinion.

Max Morkovkin 25:43
I hope some of these technologies will be able to use for our work from home conference. Okay. And the question from Keith, I think this is what everyone in our trade have in their mind these days law mill said that machine translation will only replace those humans who translate like machines, do you think it's true?

Alex Chernenko 26:05
It's true, it's like, like machines have the concept of transcreation has always been a theory that, you know, sometimes when you localize, you don't just translate. I mean, the translation works in the legal setting where each site Official documents, when we're talking about localizing there is an element always of creativity, and that the part that machine doesn't get, it learns, it improves all of that. But the creative element is still what would distinguish a human from a machine. And the same thing with movies, you can just translate the movie automatically get the meaning. So it works for news, it works for an official concept cars, but when you move you need to be creative. And that I think makes it safe from our point of view that humans still need to work on that to make it truly localized with the creative element.

Max Morkovkin 26:51
The questions are coming in come in Alex, just to remind you that you will need to pick up the best question, try to memorize all of them. Question from Thomas, what advice do you have for small and medium sized? It was these who would like to offer these services, but have smaller r&d budgets than the larger competitors?

Alex Chernenko 27:14
That's a good question. And that's what we a translator are also exploring, we don't have the budget to go for the large enterprise solution. But you know, Russia x is one of the company I mentioned in my slide, they offer free tools to try it out. Newest nuance, they also offer kind of some trials on their systems. So there are tools and free resources available that you could start and play with, before you go into full, fully acquiring the system. Or you could actually play with AR platforms to develop on yourself using those free tools. So you know, just start with the free resources at the technology if you can, or, you know, another thing you could do is just resell somebody else services. That's what you know, some of the large enterprises, they allow LSPs to resell their services. That's another way how do you get hands on that technology?

Max Morkovkin 28:05
I'm not sure if you heard but we in smart get also aware of this trend. And one of our recent releases was dedicated to subtitles preview when you can watch the video inside pane and translate subtitles and SmartCAT. So this can be an option forever for someone as well. Okay, let's take a look. Another questions that's really a hot topic. How will cognitive services help drive these new technologies as your AWS etc?

Alex Chernenko 28:38
I'm not sure if I understood the question.

Max Morkovkin 28:42
Rodrigo, please expand on these question is too short to

Alex Chernenko 28:48
all the words cognitive has a couple of meanings and cognitive load. So I'm not sure I can answer because I didn't fully understand it.

Max Morkovkin 28:57
Yeah, no problem. Rodrigo. Maybe you want to ask it in life, raise your hand, we will be glad to listen to you. We still have five minutes for questions. So okay, what about what about AI solutions for rare languages that may not have enough language data for sufficient AI training? That's a challenge.

Alex Chernenko 29:19
When you think of it this way, the systems that I presented today, they are sitting on top of natural language processing. So you know, the streaming the multimedia liquidation basis on top of Google translate the similar type of people and all the other translation agents. So that is more underlying problem of those read languages. And as you know, the languages keep getting better. And actually one of the things that I was reading yesterday, just just looking so is that now machine translation from English into Russian is better than human translation. That's official. So machine translation from English into Russian is better than the human. It's just the number of benchmarks put. So the rarer languages they are are added at this low space of time. And we are not going to see the AI powered Darebin or the AI powered subtitling for those languages until it gets to an acceptable stage. But it's a matter of time.

Max Morkovkin 30:17
There is a question from Polina, I would love to know what Alex thinks are the ethical implications of such tools and solutions for companies that work with them, not users.

Alex Chernenko 30:30
And this is a question I raised at the end. And to be honest with you from research, there is there are already a couple of startups that are actually combating the use of deep fakes. And one of one of them actually came from Ireland, there was another one that came from another country, I don't remember which, but there's two startups already detected the deep fakes. And that's basically to countryside, the ethical side of things. There's rumors that some of the some of the presidency are seen already on the screen could be doubt, you know, and, you know, I am not into conspiracy theories, but there is rumors that, you know, the queen of Elisabetta, the queen of UK has been appeared on the screen as a defect we do, you know, whether it's true or not, but those systems are already detecting the use of defects. And they're getting so good that they're learning how to overcome. So the ethical question is that, at some point, we cannot say what we are seeing in the news has been said by that person. And that person hasn't been even aware that they've appeared on that screen. So it's very unethical. And regulation needs to happen around media around news to regulate it in a, in a creative environment with movies, it's acceptable, but when it goes to media, or news, or, or legal constant, then what regulation needs to happen. And at the moment, there is no regulation. So it's, it's very controversial.

Max Morkovkin 31:59
Rodrigo, send us additional message. Sorry, I meant speech to tech automation, tax to speech using cognitive services in particular.

Alex Chernenko 32:11
Text to Speech, yeah, yeah. Yeah, the text to speech is, is already improving. And there is this concept of clone. Voice cloning. So in terms of, in terms of speech to text technology is so advanced that they could, it could add emotions, it could add the feelings. It's not just, you know, reading the text in the plane. But also, you know, the technology is so advanced now, you could clone someone's voice, and completely represent it. And you can add anger, you can add happen as you could add an exclamation where you want. So, you know, the system already get the element of emotion gets an element of tonality, and all the kind of subsections of the voice and audio that a human voice would have, making it for an average human being indistinguishable for some languages already.

Max Morkovkin 33:01
I like so far, it's recorded a number of questions coming for the presenter. So let's choose the best question and make a giveaway.

Alex Chernenko 33:14
The question that came in just before a drink, it was a lady answered it. Polina. Yeah, I think, ethical implications, because that's the question that I want everybody to live with. And it's the right question to be asking for these technologies. Because, you know, it's it's an area of great concern. That's the only concern with these that we have. The technology is always great for as long as at does good for the humanity if the technology is used to manipulate somebody to create fake news or other things. That's when regulation needs to happen. So Polina, you're getting the book for posing this question.

Max Morkovkin 33:54
Congress devoleena I hope you will enjoy it. And maybe it will be one of your topics that you will present with on some of the conferences

Alex Chernenko 34:03
in the tech guys. Thank you very much for having me. I hope it was interesting job of full of information. I ran through it very quickly. But if anybody wants some of the additional slides, we got actually cut like 20 or 30 slides from the presentation to make it to down to 20 minutes. If somebody wants to get additional information, feel free to reach out to me, and I could send you a copy of slides on LinkedIn or email.

Max Morkovkin 34:29
It was really interested in a hot topic. Thank you very much, Alex. Looking forward to the rest of the conference. Thank you.

Discover why 25% of the Fortune 500 choose Smartcat

Book a meeting

Speakers

Alex Chernenko

CEO at Translit