← All events

The language market has hit peak human translation – Now What?

February 7, 5:39 PM
YouTube video player

Human translation has hit its peak. Although interest and demand remain high, there are not enough linguists or time to meet the ever-growing demand for more content in more languages. In this presentation, Dr. DePalma describes the challenge of never-ending global content volumes, quantifies the human shortfall, and explores both the expected solutions plus several novel human augmentations.

Transcription

Bryan Montpetit 00:06 Next, we have don de Palma. And as as you know, if you haven't had a chance to see Don dipalma For this presentations are always what I call brain vendors. So there's so much information, there's so much quality information that, you know, my brain typically hurts after, but could just be me. But I always enjoy listening to him speak as well. Again, Don's presentation, which is called the the language market has hit peak human translation. Now what? So I'm sure it's going to be a fantastic in depth presentation within the time span that we have, of course. So Don, I'm not sure if you can hear me. Ken, welcome. I'm happy to have you back. Glad to be here. Let me grab the scimitar share you first. And there we are, we had a little bit of inception. Don DePalma 01:03 Okay. So I think we have a screen here. Bryan Montpetit 01:07 We do indeed. Absolutely. We can see the little zoom menu at the top. I don't know if that's intentional or not. If it may hide information, just There we go. All right, and I will leave you to it. And I'll pop back in when there's about five minutes left. Don DePalma 01:22 Okay, great. Okay. Well, thank you. And hello, everyone from snow and sunny Massachusetts. I'm Don department the Founder and Chief Research Officer at CSA research. As Brian said, the title of my presentation is the language markets at peak human translation. What that does is it states a problem, there's just too much stuff to translate. I oppose a solution in the form of several technologies that underpin the conferences theme of stronger by sharing. These technologies support what we CSA research have been calling augmented translation. That's the close collaboration of human translators with various machine services to make them more effective and more efficient. What I'll present today are several technologies we've been following for a long time. At the back of the presentation, there's a bibliography bibliography of our research on these topics, and a link to a blog that should be should be published by now to summarize today's presentation. So in our discussion today, I'll outline why these technologies are needed. There's just too much content to process. In the course of that I'll take you through a thought experiment about why human translators need machines to meet the demands of a multilingual world. And then I'll discuss some technologies behind augmented translation, and how they combined to empower and importantly uplift linguist while simultaneously making translation more attractive from a practice and logistical perspective. So onto the topic, the deluge of content that the language sector has to deal with every couple of years. IDC updates, what it calls its global datasphere sizing of the market for computer storage systems, and what the systems in fact store. And its most recent study, IDC estimated that the world would add 59 zettabytes of data to disks and solid state storage in 2020. As noted, a zettabyte is a trillion gigabytes. Now, this is all kinds of data. There's content types, such as numeric imagery text, there's untold combinations of file formats, there's data for every kind of knowledge domain, in 1000s of languages, you get the picture, there's a lot of stuff out there. Now, I should note that much of this growth is in sensor data telemetry, other machine generated content. For example, IDC said that real time data like these from all the devices around this are going to grow 30% By 2025, driven by 150 billion devices. That's machine to machine content and a balloon from 4900 digital interactions now, that 6 billion consumers have every day. So in terms of management, there's a lot of structured data out there and knows its place. But the problem is, and what the language industry deals with is all of the unstructured data. Let's consider the data that's more open to interpretation than databases and sensory data and telemetry. It's the content that makes up websites, technical publications, marketing, collateral and support material. Now that productivity and embedded content data is growing as well slated to be increased at about a 40% CAGR compounded annual growth over the following coming over the coming years. The reality here is that very little of that content is ever touched for any purpose, then reuse. In fact, much of it never leaves the format of the silo that created according to IDC, just 11% of that 59 zettabytes of data this year will actually be new, the other 89% is consumed or used in some manner. There's enormous amounts of copying, play Tourism repurposing translation. And what we like to think of is any kind of transmogrification anytime any little bit of it is leveraged. And that brings us to my mascot of the content deluge. It's the Ouroboros of the ancient Egyptians, the snake that constantly renews itself by feeding on itself that 90% of reuse every year. Now, how much of that content is actually translated? Well, it's time for a small exercise in data analysis. So here are some data points that have I've extracted from the datasphere study, and the CS CSAs research sizing of the translation market. First off, there are 10.9 times 10 to 16 bytes of content generated every day. I've already removed all the data that's unlikely to be translated from that number. So that's why brings it down from the 59 zettabytes. If you're trying to contrast these numbers, so we attend to nine with 16 zeros following rather than 59 with 21 zeros following. Now, the written professional translation market is $33 billion per year, to get to the daily translation output, simply divide by 365. Now we assume that translation is going on every day. And so that's just stuff which is accumulating over time, then we divide the results by 4.3 million. I've already done some calculations in advance here to to derive that number and save you some time. Now, for the daily amount of bytes generated every day, let's calculate the percentage that is professionally translated in the industry. So does everyone have their calculators ready? Okay, I'm going to start the timer. Pencils down. Let's see how you did. There's the answer, the global translation industry, all of the LSPs out there, and we're not counting stuff done inside businesses currently translates approximately 800 million of 1% of all content, which is generated globally, every day. That's seven zeros to the right of the decimal point. So you can see that it is an infinitesimally small piece of the bid daily potential for translated content. By the way, email me if you got the right answer don@csa.research.com. For those of you who didn't get it, right, you can check your math against the slide once the presentations are ready for downloading. Now let's look at what that means in terms of translator capacity. We'll take the next step use a term from economics peak translation, that's when a market supply or demand has reached the maximum. What happens after that? Well, we're starting out now with an average translator outputting. About 20 250 words per day, now converted into bytes. This yields 43,650 Bytes a day translation volume. To do that would require 166,000 translators. Let's assume for the sake of argument that just one 10 million to the content identified in the previous slide is interesting and worth translating. How many translators would it take to translate that much content into just one language? Well, we find it would take 25 million translators assuming they did nothing but translate. Now this is a combined population of Romania and Denmark. And remember that many languages, including official EU languages, have been a fraction of the number of this number of speakers so not going to be able to find those translators. Now, suppose that we want the content in the 135 languages that CSA tracks that have online economic significance? Well, there The numbers are staggering, 3.4 billion linguist, again, doing nothing but translating. This is a combined population of China, India, the European Union and the United States. Now, this would require a major economic stimulus program to make that happen. Let's call it the full employment. For linguists. I, by the way, I refer to this image here as The Rime of the Ancient meme. So basically, we're gonna need a bigger boat, humans can handle the translation volume alone, let's look at the changing map of language services supply republishes forecast at the end of 2020, to visualize how the delivery of written language services will change over the coming decade. What it does is it shows general growth in all sorts of language services, up to 2020 with the bulk of professional services as traditional written human language services. That was the peak of human translation. But the content and translation volumes that I described a few minutes ago are changing the equation. Starting this year, Human Services as a percentage of the overall market will decline. You can expect to use some CAT tools and workflow to assess but it's really nothing that sophisticated. We forecast that the growth in post editing will taper off and begin to decline in the future. And there's some reasons for that. We can talk about it later, but there's not time in this presentation. You should, increasingly we expect customers and we're seeing this already in half for years, we'll expect more empty and other intelligence services to be part of the mix. human expertise, knowledge and timely involvement in translation will augment the machine in its evolution. So it's important that it's human, as well as AI delivered services. Now, rather than dive into discussions of whether MT and AI are good or bad for humankind, for now, I'll stipulate to any and all arguments about machine translation and AI driven systems that is empty is great. But devil but empty is also the devil spawn. A AI is wonderful, but artificial intelligence is naturally stupid. The bottom line is that the language industry is a big 10. There are huge volumes of content and enormous variety, and everything that goes on. So you can take any point of view on any topic in any technology in any service, and switch mid argument and and just argue the other side. So let's move on now to the the the main part of the presentation here, augmented translation, what's happening? How will that top wave of translation production, bringing together humans and AI and machine translation, all those things to come together? Well, a few years ago, we had CSA. So several language technologies come into being some are longtime translation software's it's really, they're really undergoing a revolution. And we base the term augmented translation on on the concept of augmented reality, which refers to modes of human interaction with a real world mediated by computers. Now in this image here, you see a customer interacts with a variety of with a technology enabled linguist who is supported by a variety of technologies. And we'll go through each of these I have to cautions at the outset, these will be just brief overviews. There's a lot more we can say about these solutions. Second, we've been covering these technologies for years. But unfortunately, some of them have yet to see mainstream adoption. So there's some market developments still ahead of us that has to happen around even some of the relatively mature technologies in this in this grouping. So onward to the technologies. First, let's keep in mind that the first human in the language services equation, that's the buyer, the consumer, the person who's ultimately digesting all of this content, which is being translated into multiple languages. We also have at the middle here, the linguist and remember, here, the goal is sharing making everybody more powerful. The goal is to optimize the knowledge and expertise of these highly skilled but limited professional translators. Okay, first, what we've got here is the accelerator for augmented translation, it's about adaptive neural Mt. What that means is that in near real time, linguists can make a correction and have that incorporated into the engine and all of its suggestions so that machine learning happens immediately. Or almost immediately. If a translator finds that empty is incorrectly translated something and then they have that opportunity. They're in the process, not post editing, but editing as it's being developed. So it corrects it, that same correction will appear for that translator and others working in the same and similar texts. Now this capability works only with pervasive use of adaptive Mt. So it's used across an organization so that everyone working on the project benefits from that near real time correction made by by them and their colleagues. We add to this quality estimation to provide an independent guide for suitability of the translation. Independent Means you get an assessment of quality from a separate utility. You also get input from professionally skilled translators with domain area expertise. Looking more closely at the next element is translation memory. That's a long time but still evolving technology. And we see some significant changes now in newer versions of translation memory, first off TM and MT shared data. Although their functions are different, they share the fundamental goal of reusing what's already been translated. Looking forward, the boundaries between these two are increasingly becoming thin. Now newer MT also goes beyond traditional matching to include sub segment matching. That's becoming more like empty. In addition, some segments use what is called shallow parsing, or intelligent matching to correct for things like changes of dates or numbers. Those things would normally be fuzzy matches, this could make them into perfect match matches. And then translation memory is a mature technology. But we're seeing language technology developers use AI to handle tasks such as cleaning up translation memories, identifying segments that are likely to cause problems, finding the best contextual match when there's multiple ones existing at a TN, and again, that's backstopped by that linguist doing the pre editing rather than the post editing Next up, we have a new generation of terminology management that differs from traditional glossary based word lists and concept oriented term bases in several fundamental ways. First off, developers are moving from a top down approach to emphasize documenting terminology as it's actually used. For example, let's say Samsung releases a tablet called a Note Pad Pro. It may find though that it's in its support organization customers are calling in and asking about problems with the iPad. The Kleenex of tablets, a Help system not trained to read iPad has notepad Pro is going to increase costs for Samsung, and arm the brand. newer systems actively monitor online discourse to understand how people are talking about products and how they how they're using language in the wild. This such discovery can link how the world discusses an organization with its corporate internal view that's resulting in a more responsive approach to any kinds of customer communication. And finally, importantly, these systems are moving beyond terms of single words or short phrases to include micro content such as brands, slogans, legal disclaimers images. The advantage of this approach overs, standard translation, memory and terminology management for such content is creators can assign rich and robust metadata with these terminal logical structures. So they can document which product lines the iPad, notepad, pro whatever, or services contained versions of legal just notifications or other bits of documentation. Let's move away from linguistic functions. There's also lights out project management. That means that a lot of tedious tasks and be automated with it all can, for example, can automate the routing of jobs to the best linguist and monitoring monitor their progress. For example, rather than requiring a project manager to find the best person go by going through previous jobs, resumes or personal preferences, these systems can proactively just assign jobs based on skill sets and past performance. They factor in things like deadlines and workloads from other projects that that particular linguist might be working on. They can find translators and kickoff jobs without a human project manager even touching the system. This obviously reduces the number of small touch points, such as confirming the receipt that in our research on muda waste often add up to big time sinks. Our research shows that it's not unusual for such tasks to consume half of the allowed time for a project before The Linguist even gets the job to work on. These systems a lot rely obviously on machine learning to observe language and content to make predictions. Developers are now working with an unintended or unexpected consequence here, how can they balance the needs of the systems that have to get more data about how people are performing against the individual's right to privacy and how they're working. So this is a conundrum that are still being worked out in the greater scheme of things. So this capability might have some limitations. Moving forward. Then we have the automated the concept of automated content enrichment or ACE. This is a very useful technology that we've been researching for a few years now, but it's still scarcely known. It parses content to find entities such as words, concepts, names, and dates that have defined meaning, and they can be linked to authoritative resources with relevant metadata. Ace then embeds these links to online information about them, and includes recommended translations definitions, organization specific data, locale specific details, and other contextual information that the translators would otherwise have to look up before they started the job. Instead, it's right in front of him when they start, linguist tasked with translating ace enhanced intelligent content, thus begin the job with basic research already sitting in the source file. That's particularly important when there's terminology that's specific to a business to a product to a particular domain. So the piece can be done right. Unfortunately, these intelligent tagging systems have existed since the mid 2000s, a couple of 15 years or so. But adoption has been slow. Right here, there's a link to a blog I wrote a few years ago about something that came out of the European Union and the framework project. Finally, there's the platform that ties all these technologies together. And so doing it's the operating system for a translation company, or for a localization department. Everything runs through it. All of the pieces share data with each other and with a linguist via the TMS. It tracks activities to make sure resources are where they need to be when they're supposed to be there and the projects are moving as planned. They also handle fine Natural and billing tasks that he brought up a lot of the time that managers spend freeing them up to address basic things like strategy. Importantly, TMS is are no longer the monolithic beasts that they used to be the required months or years to integrate and teams of specialists to run, the newer systems are much lighter nimbler the TMS cell TMS itself is becoming a bundle of small micro services. individual applications can be connected in a variety of ways, and we don't have time to go into it today. But this is basically a major rethinking of how technology interacts with language products. The TMS here is seen as a Korean, a Korean choreographer of everything that goes on, which is why we illustrate it behind everything else. But that ends the lightning round of the seven technologies. Let's look at what this means to The Linguist. We really have to understand that this language is at the core of it, the goal is to optimize the knowledge and expertise of these highly skilled professionals. These technologies all augment the linguists capabilities in several ways, taking the several 100,000 that are out there into a much bigger cadre of linguist language instead of post editing ex post facto, they're in the thick of the workflow. From a content intelligence perspective, the content carries a greater amount of the semantic load. So translators can skip a lot of lookups and other details when the content is pre populated with links and pointers. From a project management perspective, the tasks that can be learned by a machine can be done done without thinking or offloaded, allowing the linguists to spend more time on on the what they really enjoy doing. And then there's a TMS that orchestrates everything, data flow, integration, and everything else. So let's summarize today's session and make some conclusions. If you take away anything from today's session, just remember these First off is the combination, the future is not in pure empty or pure human translation. It's pretty obvious. But what we haven't looked at is the intelligent combination of the two. What we're looking at here with augmented translation is that the distinction between human translation and machine translation and post editing, we considerably blurred in favor of a future that uses machines and humans in varying mixtures to meet specific needs. augmentation, the future success and translation requires more than just empty. The technologies here demonstrate how machines augment the very necessary capabilities and knowledge bases of these humans elevation the this augmentation will not replace language, but will instead make them more valuable by letting them focus on interesting and rewarding tasks. And finally, transformation. We're at a moment of fundamental change in the language technology sector. It promises to deliver significant benefits to organizations of all types and language alike. But they have to be willing to change, adjust, and invest to see these improvements. And I'll return to another old mean here, the $6 million man enhanced by nuclear powered bionic body parts. Augmented translation has a similar potential here to supercharge today's already highly skilled translators, we can rebuild them, we have the technology, we can make them better than they were better, stronger, faster. So last slide really here, you can save some of the research reports CSAs produced around the topics I discussed today. It lists the year in which we started these streams of research. We also posted a blog earlier today summarizing the major themes in today's presentation, giving a lot more links and providing some more information. That's the highlighted text at the bottom of this. And with that, I conclude my presentation and invite any questions. Bryan Montpetit 23:42 All right, thank you, Don. I appreciate the presentation. As always, it's a wealth of information and you just keep pounding the data. I love it. We did have some questions that came up. So I'll go through them if you can, please do your best to provide short answers. I just want to make sure we can fit everything in. So the first one says Hi Don your blurb blurb. It reads although interest in demand remain high, there are not enough linguists or time to meet the ever growing demand for more content in more languages. If it's a slight supply demand issue, expert translators renumeration must be skyrocketing. Have you observed this in the translation market? Don DePalma 24:19 Where it's a great question. Yeah, there's never enough of the translators that you need in the in the vertical sectors that demand more than just generalist information. So when you're in areas like you know, life sciences or financial services, or any kind of heavy duty physics, these are things where you need somebody who can the joke, one of the things we put into the piece we published today is that you know, some translators are just uncomfortable dealing with anything that has the word quantum in it, or Bitcoin or derivatives or Romanian manifold. So anytime any of these were or Shut up, you can be sure that there's a request for a translator. So that causes some delays in some projects, especially if they run into as they move into an more agile development mode as well. So yeah, the need for people who know what they're doing in a particular domain is growing and much greater than it's ever been. Bryan Montpetit 25:22 Great. Thank you. Oh, Don, would you mind stop sharing your screen? So actually, the the will have centered view on us. Not that people want to see see me in particular. But hey, it's, and thank you for that. You can keep your camera on if you'd like. We also had another question that said, size of written translation market for one year $33 billion, does not correspond with what Kirti event she has written? Where does your finger come from? Don DePalma 25:50 Well, what we do is every year and has been doing since 2005, and in a very systematic way, since 2010, is running a the annual global market survey, or study, which actually is available right now to any LSPs, contact me, and I can give you the link, so you can participate. What we do is every year get about 500 Plus LSPs, to give us their numbers. So we get revenue, we get a variety of details from them, we have an algorithm that we've been using since 2010. To size the market, the 33 billion is just the translation part of a $49.6 billion language services market. Curtius number may include the that other $16 billion, which includes things such as interpreting localization, a variety of other language related services, and some small amount of a couple billion dollars worth of technology. So if he's numbers around $49.6 billion, he's probably incorporating more, there's also moving out, I've seen numbers. A few years ago to conference, somebody said the market was actually about a trillion dollars for language services. But I think ultimately, if you include every translation that anyone does, including what's happening in Google, and Yandex and Alibaba, and wherever else, ultimately, where you end up and put it put $1 value to it, we're talking about enormous amounts of money with translation is and interpreting are two of the oldest professions. So Bryan Montpetit 27:27 fantastic. Thanks. Thanks for the insight. And I actually haven't seen a courteous number. So I don't know where it where it was, but Don DePalma 27:34 30 Asking Hi, Kirti. Bryan Montpetit 27:37 I don't believe it was. We also have another another question that's coming from New York to find we have adaptive MT seems to be an improvement over MT and post editing. However, it doesn't seem to be widely adopted today. Any reason or insight as to why? Don DePalma 27:54 There's only a few solutions that have been out for a few years. So it's basically a question of redevelopment, a new development happening. So we expect that to increase dramatically over the coming years. I remember, probably about eight or nine years ago, I was keynoting. an EMT conference in in Trento, Italy. And there was a poster project by a whole bunch of empty graduate students showing essentially how long it took to retrain a statistical machine engine. And about eight years ago, we're talking about weeks or months to retrain empty, with neural MT n over n statistical for that matter. Over the last few years, we've seen dramatic changes in the speed of retraining these engines. So it's happening as the newer engines come out, the newer cores come out, you can expect them to be able to retrain and thus provide the support for adaptive in the new versions, rather than putting it off into a queue to be retrained overnight or over the next week. Bryan Montpetit 29:03 Great. Thanks for the for the answer. I appreciate it. And thank you for participating in the presentation today. I know we're at time, so I'm going to give people enough time to run off and grab a drink of water or coffee or tea and have us haven't come back in about 10 minutes or so. So on the hour. Don, thanks again. I really appreciate you helping us out giving the presentation all again the wonderful data. As I said before, it's a brain Bender for me. Basically the speed at which it's going to have to try and retain everything. And I look forward to when I can download the presentation afterwards. So thanks for that. Don DePalma 29:38 Thanks, everyone. Like Great to see Bryan Montpetit 29:41 you take care of yourself.

Discover why 25% of the Fortune 500 choose Smartcat