Google Teaches AI To Play The Game Of Chip Design – The Next Platform
Posted: February 22, 2020 at 8:44 pm
If it wasnt bad enough that Moores Law improvements in the density and cost of transistors is slowing. At the same time, the cost of designing chips and of the factories that are used to etch them is also on the rise. Any savings on any of these fronts will be most welcome to keep IT innovation leaping ahead.
One of the promising frontiers of research right now in chip design is using machine learning techniques to actually help with some of the tasks in the design process. We will be discussing this at our upcoming The Next AI Platform event in San Jose on March 10 with Elias Fallon, engineering director at Cadence Design Systems. (You can see the full agenda and register to attend at this link; we hope to see you there.) The use of machine learning in chip design was also one of the topics that Jeff Dean, a senior fellow in the Research Group at Google who has helped invent many of the hyperscalers key technologies, talked about in his keynote address at this weeks 2020 International Solid State Circuits Conference in San Francisco.
Google, as it turns out, has more than a passing interest in compute engines, being one of the large consumers of CPUs and GPUs in the world and also the designer of TPUs spanning from the edge to the datacenter for doing both machine learning inference and training. So this is not just an academic exercise for the search engine giant and public cloud contender particularly if it intends to keep advancing its TPU roadmap and if it decides, like rival Amazon Web Services, to start designing its own custom Arm server chips or decides to do custom Arm chips for its phones and other consumer devices.
With a certain amount of serendipity, some of the work that Google has been doing to run machine learning models across large numbers of different types of compute engines is feeding back into the work that it is doing to automate some of the placement and routing of IP blocks on an ASIC. (It is wonderful when an idea is fractal like that. . . .)
While the pod of TPUv3 systems that Google showed off back in May 2018 can mesh together 1,024 of the tensor processors (which had twice as many cores and about a 15 percent clock speed boost as far as we can tell) to deliver 106 petaflops of aggregate 16-bit half precision multiplication performance (with 32-bit accumulation) using Googles own and very clever bfloat16 data format. Those TPUv3 chips are all cross-coupled using a 3232 toroidal mesh so they can share data, and each TPUv3 core has its own bank of HBM2 memory. This TPUv3 pod is a huge aggregation of compute, which can do either machine learning training or inference, but it is not necessarily as large as Google needs to build. (We will be talking about Deans comments on the future of AI hardware and models in a separate story.)
Suffice it to say, Google is hedging with hybrid architectures that mix CPUs and GPUs and perhaps someday other accelerators for reinforcement learning workloads, and hence the research that Dean and his peers at Google have been involved in that are also being brought to bear on ASIC design.
One of the trends is that models are getting bigger, explains Dean. So the entire model doesnt necessarily fit on a single chip. If you have essentially large models, then model parallelism dividing the model up across multiple chips is important, and getting good performance by giving it a bunch of compute devices is non-trivial and it is not obvious how to do that effectively.
It is not as simple as taking the Message Passing Interface (MPI) that is used to dispatch work on massively parallel supercomputers and hacking it onto a machine learning framework like TensorFlow because of the heterogeneous nature of AI iron. But that might have been an interesting way to spread machine learning training workloads over a lot of compute elements, and some have done this. Google, like other hyperscalers, tends to build its own frameworks and protocols and datastores, informed by other technologies, of course.
Device placement meaning, putting the right neural network (or portion of the code that embodies it) on the right device at the right time for maximum throughput in the overall application is particularly important as neural network models get bigger than the memory space and the compute oomph of a single CPU, GPU, or TPU. And the problem is getting worse faster than the frameworks and hardware can keep up. Take a look:
The number of parameters just keeps growing and the number of devices being used in parallel also keeps growing. In fact, getting 128 GPUs or 128 TPUv3 processors (which is how you get the 512 cores in the chart above) to work in concert is quite an accomplishment, and is on par with the best that supercomputers could do back in the era before loosely coupled, massively parallel supercomputers using MPI took over and federated NUMA servers with actual shared memory were the norm in HPC more than two decades ago. As more and more devices are going to be lashed together in some fashion to handle these models, Google has been experimenting with using reinforcement learning (RL), a special subset of machine learning, to figure out where to best run neural network models at any given time as model ensembles are running on a collection of CPUs and GPUs. In this case, an initial policy is set for dispatching neural network models for processing, and the results are then fed back into the model for further adaptation, moving it toward more and more efficient running of those models.
In 2017, Google trained an RL model to do this work (you can see the paper here) and here is what the resulting placement looked like for the encoder and decoder, and the RL model to place the work on the two CPUs and four GPUs in the system under test ended up with 19.3 percent lower runtime for the training runs compared to the manually placed neural networks done by a human expert. Dean added that this RL-based placement of neural network work on the compute engines does kind of non-intuitive things to achieve that result, which is what seems to be the case with a lot of machine learning applications that, nonetheless, work as well or better than humans doing the same tasks. The issue is that it cant take a lot of RL compute oomph to place the work on the devices to run the neural networks that are being trained themselves. In 2018, Google did research to show how to scale computational graphs to over 80,000 operations (nodes), and last year, Google created what it calls a generalized device placement scheme for dataflow graphs with over 50,000 operations (nodes).
Then we start to think about using this instead of using it to place software computation on different computational devices, we started to think about it for could we use this to do placement and routing in ASIC chip design because the problems, if you squint at them, sort of look similar, says Dean. Reinforcement learning works really well for hard problems with clear rules like Chess or Go, and essentially we started asking ourselves: Can we get a reinforcement learning model to successfully play the game of ASIC chip layout?
There are a couple of challenges to doing this, according to Dean. For one thing, chess and Go both have a single objective, which is to win the game and not lose the game. (They are two sides of the same coin.) With the placement of IP blocks on an ASIC and the routing between them, there is not a simple win or lose and there are many objectives that you care about, such as area, timing, congestion, design rules, and so on. Even more daunting is the fact that the number of potential states that have to be managed by the neural network model for IP block placement is enormous, as this chart below shows:
Finally, the true reward function that drives the placement of IP blocks, which runs in EDA tools, takes many hours to run.
And so we have an architecture Im not going to get a lot of detail but essentially it tries to take a bunch of things that make up a chip design and then try to place them on the wafer, explains Dean, and he showed off some results of placing IP blocks on a low-powered machine learning accelerator chip (we presume this is the edge TPU that Google has created for its smartphones), with some areas intentionally blurred to keep us from learning the details of that chip. We have had a team of human experts places this IP block and they had a couple of proxy reward functions that are very cheap for us to evaluate; we evaluated them in two seconds instead of hours, which is really important because reinforcement learning is one where you iterate many times. So we have a machine learning-based placement system, and what you can see is that it sort of spreads out the logic a bit more rather than having it in quite such a rectangular area, and that has enabled it to get improvements in both congestion and wire length. And we have got comparable or superhuman results on all the different IP blocks that we have tried so far.
Note: I am not sure we want to call AI algorithms superhuman. At least if you dont want to have it banned.
Anyway, here is how that low-powered machine learning accelerator for the RL network versus people doing the IP block placement:
And here is a table that shows the difference between doing the placing and routing by hand and automating it with machine learning:
And finally, here is how the IP block on the TPU chip was handled by the RL network compared to the humans:
Look at how organic these AI-created IP blocks look compared to the Cartesian ones designed by humans. Fascinating.
Now having done this, Google then asked this question: Can we train a general agent that is quickly effective at placing a new design that it has never seen before? Which is precisely the point when you are making a new chip. So Google tested this generalized model against four different IP blocks from the TPU architecture and then also on the Ariane RISC-V processor architecture. This data pits people working with commercial tools and various levels tuning on the model:
And here is some more data on the placement and routing done on the Ariane RISC-V chips:
You can see that experience on other designs actually improves the results significantly, so essentially in twelve hours you can get the darkest blue bar, Dean says, referring to the first chart above, and then continues with the second chart above. And this graph showing the wireline costs where we see if you train from scratch, it actually takes the system a little while before it sort of makes some breakthrough insight and was able to significantly drop the wiring cost, where the pretrained policy has some general intuitions about chip design from seeing other designs and people that get to that level very quickly.
Just like we do ensembles of simulations to do better weather forecasting, Dean says that this kind of AI-juiced placement and routing of IP block sin chip design could be used to quickly generate many different layouts, with different tradeoffs. And in the event that some feature needs to be added, the AI-juiced chip design game could re-do a layout quickly, not taking months to do it.
And most importantly, this automated design assistance could radically drop the cost of creating new chips. These costs are going up exponentially, and data we have seen (thanks to IT industry luminary and Arista Networks chairman and chief technology officer Andy Bechtolsheim), an advanced chip design using 16 nanometer processes cost an average of $106.3 million, shifting to 10 nanometers pushed that up to $174.4 million, and the move to 7 nanometers costs $297.8 million, with projections for 5 nanometer chips to be on the order of $542.2 million. Nearly half of that cost has been and continues to be for software. So we know where to target some of those costs, and machine learning can help.
The question is will the chip design software makers embed AI and foster an explosion in chip designs that can be truly called Cambrian, and then make it up in volume like the rest of us have to do in our work? It will be interesting to see what happens here, and how research like that being done by Google will help.
More here:
Google Teaches AI To Play The Game Of Chip Design - The Next Platform
- The Top Five AWS Re:Invent 2019 Announcements That Impact Your Enterprise Today - Forbes [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- The Bot Decade: How AI Took Over Our Lives in the 2010s - Popular Mechanics [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- Cloudy with a chance of neurons: The tools that make neural networks work - Ars Technica [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- Measuring Employee Engagement with A.I. and Machine Learning - Dice Insights [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- Amazon Wants to Teach You Machine Learning Through Music? - Dice Insights [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- NFL Looks to Cloud and Machine Learning to Improve Player Safety - Which-50 [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- Machine Learning Answers: If Nvidia Stock Drops 10% A Week, Whats The Chance Itll Recoup Its Losses In A Month? - Forbes [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- The NFL And Amazon Want To Transform Player Health Through Machine Learning - Forbes [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- Managing Big Data in Real-Time with AI and Machine Learning - Database Trends and Applications [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- 10 Machine Learning Techniques and their Definitions - AiThority [Last Updated On: December 9th, 2019] [Originally Added On: December 9th, 2019]
- This AI Agent Uses Reinforcement Learning To Self-Drive In A Video Game - Analytics India Magazine [Last Updated On: December 31st, 2019] [Originally Added On: December 31st, 2019]
- Machine learning to grow innovation as smart personal device market peaks - IT Brief New Zealand [Last Updated On: December 31st, 2019] [Originally Added On: December 31st, 2019]
- Can machine learning take over the role of investors? - TechHQ [Last Updated On: December 31st, 2019] [Originally Added On: December 31st, 2019]
- The impact of ML and AI in security testing - JAXenter [Last Updated On: December 31st, 2019] [Originally Added On: December 31st, 2019]
- Are We Overly Infatuated With Deep Learning? - Forbes [Last Updated On: December 31st, 2019] [Originally Added On: December 31st, 2019]
- Will Artificial Intelligence Be Humankinds Messiah or Overlord, Is It Truly Needed in Our Civilization - Science Times [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Get ready for the emergence of AI-as-a-Service - The Next Web [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Clean data, AI advances, and provider/payer collaboration will be key in 2020 - Healthcare IT News [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- An Open Source Alternative to AWS SageMaker - Datanami [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Federated machine learning is coming - here's the questions we should be asking - Diginomica [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Iguazio pulls in $24m from investors, shows off storage-integrated parallelised, real-time AI/machine learning workflows - Blocks and Files [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- New York Institute of Finance and Google Cloud launch a Machine Learning for Trading Specialisation on Coursera - HedgeWeek [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Short- and long-term impacts of machine learning on contact centres - Which-50 [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Iguazio Deployed by Payoneer to Prevent Fraud with Real-time Machine Learning - Yahoo Finance [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Regulators Begin to Accept Machine Learning to Improve AML, But There Are Major Issues - PaymentsJournal [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- What Is Machine Learning? | How It Works, Techniques ... [Last Updated On: January 27th, 2020] [Originally Added On: January 27th, 2020]
- Global Deep Learning Market 2020-2024 | Growing Application of Deep Learning to Boost Market Growth | Technavio - Business Wire [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- The Human-Powered Companies That Make AI Work - Forbes [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- UB receives $800,000 NSF/Amazon grant to improve AI fairness in foster care - UB Now: News and views for UB faculty and staff - University at Buffalo... [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Euro machine learning startup plans NYC rental platform, the punch list goes digital & other proptech news - The Real Deal [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- New Project at Jefferson Lab Aims to Use Machine Learning to Improve Up-Time of Particle Accelerators - HPCwire [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- This tech firm used AI & machine learning to predict Coronavirus outbreak; warned people about danger zones - Economic Times [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Reinforcement Learning: An Introduction to the Technology - Yahoo Finance [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Reinforcement Learning (RL) Market Report & Framework, 2020: An Introduction to the Technology - Yahoo Finance [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Top Machine Learning Services in the Cloud - Datamation [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- In Coronavirus Response, AI is Becoming a Useful Tool in a Global Outbreak - Machine Learning Times - machine learning & data science news - The... [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Combating the coronavirus with Twitter, data mining, and machine learning - TechRepublic [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- Speechmatics and Soho2 apply machine learning to analyse voice data - Finextra [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- REPLY: European Central Bank Explores the Possibilities of Machine Learning With a Coding Marathon Organised by Reply - Business Wire [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- What is Machine Learning? A definition - Expert System [Last Updated On: February 4th, 2020] [Originally Added On: February 4th, 2020]
- How to Train Your AI Soldier Robots (and the Humans Who Command Them) - War on the Rocks [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Would you tell your innermost secrets to Alexa? How AI therapists could save you time and money on mental health care - MarketWatch [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Cisco Enhances IoT Platform with 5G Readiness and Machine Learning - The Fast Mode [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Buzzwords ahoy as Microsoft tears the wraps off machine-learning enhancements, new application for Dynamics 365 - The Register [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Inspur Re-Elected as Member of SPEC OSSC and Chair of SPEC Machine Learning - HPCwire [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- How to Pick a Winning March Madness Bracket - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Syniverse and RealNetworks Collaboration Brings Kontxt-Based Machine Learning Analytics to Block Spam and Phishing Text Messages - MarTech Series [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Grok combines Machine Learning and the Human Brain to build smarter AIOps - Diginomica [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Machine Learning: Real-life applications and it's significance in Data Science - Techstory [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- Why 2020 will be the Year of Automated Machine Learning - Gigabit Magazine - Technology News, Magazine and Website [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- What is machine learning? Everything you need to know | ZDNet [Last Updated On: February 22nd, 2020] [Originally Added On: February 22nd, 2020]
- AI Is Top Game-Changing Technology In Healthcare Industry - Forbes [Last Updated On: February 23rd, 2020] [Originally Added On: February 23rd, 2020]
- Removing the robot factor from AI - Gigabit Magazine - Technology News, Magazine and Website [Last Updated On: February 23rd, 2020] [Originally Added On: February 23rd, 2020]
- This AI Researcher Thinks We Have It All Wrong - Forbes [Last Updated On: February 23rd, 2020] [Originally Added On: February 23rd, 2020]
- TMR Projects Strong Growth for Property Management Software Market, AI and Machine Learning to Boost Valuation to ~US$ 2 Bn by 2027 - PRNewswire [Last Updated On: February 29th, 2020] [Originally Added On: February 29th, 2020]
- Global Machine Learning as a Service Market, Trends, Analysis, Opportunities, Share and Forecast 2019-2027 - NJ MMA News [Last Updated On: February 29th, 2020] [Originally Added On: February 29th, 2020]
- Forget Chessthe Real Challenge Is Teaching AI to Play D&D - WIRED [Last Updated On: February 29th, 2020] [Originally Added On: February 29th, 2020]
- Workday, Machine Learning, and the Future of Enterprise Applications - Cloud Wars [Last Updated On: February 29th, 2020] [Originally Added On: February 29th, 2020]
- The Global Deep Learning Chipset Market size is expected to reach $24.5 billion by 2025, rising at a market growth of 37% CAGR during the forecast... [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- The Power of AI in 'Next Best Actions' - CMSWire [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- Proof in the power of data - PES Media [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- FYI: You can trick image-recog AI into, say, mixing up cats and dogs by abusing scaling code to poison training data - The Register [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- Keeping Machine Learning Algorithms Humble and Honest in the Ethics-First Era - Datamation [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- Emerging Trend of Machine Learning in Retail Market 2019 by Company, Regions, Type and Application, Forecast to 2024 - Bandera County Courier [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- With launch of COVID-19 data hub, the White House issues a call to action for AI researchers - TechCrunch [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- Are machine-learning-based automation tools good enough for storage management and other areas of IT? Let us know - The Register [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- Why AI might be the most effective weapon we have to fight COVID-19 - The Next Web [Last Updated On: March 22nd, 2020] [Originally Added On: March 22nd, 2020]
- AI Is Changing Work and Leaders Need to Adapt - Harvard Business Review [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Deep Learning to Be Key Driver for Expansion and Adoption of AI in Asia-Pacific, Says GlobalData - MarTech Series [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- With Launch of COVID-19 Data Hub, The White House Issues A 'Call To Action' For AI Researchers - Machine Learning Times - machine learning & data... [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- What are the top AI platforms? - Gigabit Magazine - Technology News, Magazine and Website [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Data to the Rescue! Predicting and Preventing Accidents at Sea - JAXenter [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Deep Learning: What You Need To Know - Forbes [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Neural networks facilitate optimization in the search for new materials - MIT News [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- PSD2: How machine learning reduces friction and satisfies SCA - The Paypers [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Google is using AI to design chips that will accelerate AI - MIT Technology Review [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- What Researches says on Machine learning with COVID-19 - Techiexpert.com - TechiExpert.com [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Self-driving truck boss: 'Supervised machine learning doesnt live up to the hype. It isnt C-3PO, its sophisticated pattern matching' - The Register [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]
- Will COVID-19 Create a Big Moment for AI and Machine Learning? - Dice Insights [Last Updated On: March 29th, 2020] [Originally Added On: March 29th, 2020]