What is a data center? Wikipedia defines it as "a data center is a complex set of facilities. It includes not only computer systems and other supporting equipment (such as communication and storage systems), but also redundant data communication connections, environmental control equipment, monitoring equipment, and various security devices." Today, with the popularity of cloud computing, with the continuous expansion of data center construction and the emergence of new technologies, data centers are becoming more and more complex. Large data centers are often composed of many unit systems with different functions, and their operation and maintenance work requires knowledge of all aspects, including hardware, network, server, storage, security, and business things.

When the scale of a data center is very large, it will face more technical challenges and problems, many problems that are not problems in the small environment and small system will also be highlighted at such a scale, so to do a good job in the operation and maintenance of large data centers, it will take a long time to learn the technical system involved in all aspects of the data center, only by understanding the data center as a whole, can we formulate some targeted operation and maintenance plans, develop some monitoring and operation and maintenance software in combination with specific needs, efficiently manage and monitor the entire data center, improve the operation efficiency of the entire data center, reduce the occurrence of failures, and continuously push the operation and maintenance work to a new height.
A large data center often contains many small systems, and the operation and maintenance work is carried out around these specific application systems, which can be divided into six parts: basic operation and maintenance management, daily business operation and maintenance, network, server, storage, and security.
First of all, from the perspective of basic operation and maintenance management of data centers, it mainly includes hardware configuration management, maintainability optimization, monitoring, alarm processing, automated operation and maintenance, network disconnection, power outage, and disaster recovery in the computer room. Hardware configuration management includes the model and hardware configuration of each server in the cabinet, and it is clear which business systems are using these servers. Even in a virtualized operating environment, it is necessary to know which physical machines are flowing in the resource pool. The number of physical machines and virtual machines in data centers is huge, and it is very necessary to use automated operation and maintenance. Automated operation and maintenance can not only improve the work efficiency of operation and maintenance, but also reduce human participation, and at the same time let the data center manage itself and free up manpower. And the possible failure of the data center is also monitored and alarmed, so that the problem can be known at the first time of the failure, often a big fault is from the beginning of a small fault gradually expands and eventually causes the collapse of the entire large system, so when there are some small anomalies, they must be eliminated in time, and these anomalies must be detected by a perfect monitoring and alarm system.
From the perspective of daily business operation and maintenance of the data center, it mainly includes daily inspections, application changes, software and hardware upgrades, and sudden failures. Specifically:
1. Daily inspection: "A thousand miles of embankments, collapsed in an ant nest". Any failure may appear before it appears, and small hidden dangers may not be eliminated, which may lead to major failures. According to the importance of the data center carrying services, routine inspections should be carried out on all running equipment in the data center. Check whether the server application service is normal and whether the CPU memory utilization rate is normal. Check the application service to see if the service is running normally. The data center room environment should also be checked to see if the temperature, humidity, and dust of the environment meet the requirements. The air conditioning and power supply system are running well, whether the equipment is overheated, and the floor, skylight, fire protection, and monitoring are all part of the inspection. Air conditioner leakage and equipment leakage will cause harm to the normal and stable operation of the data center.
2. Application changes: The business carried by the data center will not be static, and with the diversification and continuous development of the business, the business often needs to be adjusted, including server and network settings. Therefore, to be familiar with the operation of servers and network equipment, it is mainly necessary to master Linux server commands and network protocols. According to the needs of the application, timely and accurate changes should be made.
3. Software and hardware upgrades: The general operating cycle of data center equipment is five years, and there are constantly equipment that needs to be gradually eliminated and replaced, and some equipment needs to be upgraded due to software defects, so software and hardware upgrades are also part of the maintenance work. When upgrading software and hardware, it is necessary to do a good job of fallback mechanism to prevent problems with upgrades, and the business cannot be restored for a long time. When taking over the maintenance of the data center, you will find that there are so many upgrades, almost every month, and staying up late to upgrade has become a common occurrence for maintenance personnel.
4. Sudden failure: No data center is free from failure, and there will be problems in the process of data center operation. For sudden failures, high-level maintenance personnel can calmly analyze the trigger cause of the failure, quickly find a solution, if they cannot find a solution in a short period of time, they can also switch to backup equipment to restore business first, and then analyze. At this time, having high-level maintenance personnel is crucial for a data center and can come in handy at critical moments. Although these tasks seem ordinary, don't underestimate them. The daily maintenance of the data center is actually very important, related to the normal operation of the entire data center business. Only by paying attention to the maintenance of the data center can the data center be safe.
From the perspective of data center network, it mainly includes network hardware equipment, ACL, OSPF, LACP, VIP, protocol analysis, traffic, load balancing, 2347 layers, network monitoring, 10,000 gigabit boards, core switching, etc. The network is an important part of the data center and the basic guarantee for all work operations. Here we should not only pay attention to the hardware issues of the network, but also pay attention to the SDN software-defined network. Generally, after the network in the traditional IT architecture is deployed according to the business requirements, if the business requirements change, it is very cumbersome to re-modify the configuration on the corresponding network equipment (router, switch, firewall). In today's rapidly changing business environment of the Internet/mobile Internet, the high stability and high performance of the network are not enough to meet the business needs, and flexibility and agility are more critical. What SDN does is to separate the control of network devices, managed by a centralized controller, without relying on the underlying network equipment (routers, switches, firewalls), shielding the differences from the underlying network devices, and the control is completely open, and users can customize any network routing and transmission rules and policies they want to implement, so as to be more flexible and intelligent. After SDN transformation, there is no need to repeatedly configure the routers of each node in the network, and the devices in the network are automatically connected. You only need to define simple network rules when using it. If you don't like the router's built-in protocol, you can also modify it programmingly to achieve better data exchange performance. For example, Baidu's self-developed switch can directly support the remote configuration and management features of SDN, so as to achieve fully automatic configuration online. In the future, self-developed switches will further integrate with server automation to improve server delivery and management efficiency. The network can be said to be all-encompassing, involving too many devices, protocols, and software layer technologies, so it is also necessary to continue to learn and deepen the understanding of network technology, so that we can do a good job in network operation and maintenance.
From the perspective of data center servers, it mainly includes file systems, kernel parameter tuning, various hard disk drivers, kernel versions, kernel panic, etc. Linux system not only occupies a mainstream position in servers, but also occupies a mainstream position in network operating systems. In addition to being familiar with the operation of Linux systems, it is also necessary to monitor and manage the operating status of servers and kernel to reduce the occurrence of server failures. Generally, large data centers contain thousands of servers, and there will be various problems on servers almost every day. In order to prevent server failures from causing business interruptions, virtualization technology or clustering technology is generally deployed on servers. These virtualization technologies increase the difficulty of operation and maintenance, and also require continuous in-depth learning of virtualization technology. In addition, the customization of data center servers is also a very meaningful thing. Cloud computing needs to be deployed on a large scale, so servers need to have higher deployment density, energy saving and easy management, but the computing power requirements for each node are not very demanding. However, the ordinary servers produced by manufacturers have to adapt to a variety of applications, so they take into account performance and scalability, ignoring cost and energy consumption. If it is a server specially customized for the cloud, it will be optimized for the characteristics of the cloud, so as to better meet the needs of users. For enterprises, the benefits are obvious, just imagine, even if the power savings of each customized server are limited (4 power sources instead of 2 power supplies), but for large-scale deployment data centers, the cost savings are also obvious in the long run. For example, the servers owned by Google are all designed by themselves, using customized trays and built-in batteries as backup power, which is much lower than the cost and power consumption of traditional servers, which also saves Google a lot of power costs.
From the perspective of data center storage, the architecture is more diverse and complex. After cloud computing, virtualization, big data and other related technologies enter the data center, storage has undergone tremendous changes, block storage, file storage, and object storage support the reading of various data types. Centralized storage is no longer the mainstream storage architecture of data centers, and the storage and access of massive data requires a distributed storage architecture with strong scalability and scalability. In terms of large-scale system support, distributed file systems, distributed object storage and other technologies provide highly scalable, scalable, and extremely elastic support and strong data access performance for various storage applications. Of course, distributed storage is not to replace the existing disk array, but to cope with the rapid growth of data volume and bandwidth. The other is software-defined storage, which represents a trend, that is, the separation of software and hardware in the storage architecture, that is, the separation of the data layer and the control layer. For data center users, the management and scheduling of storage resources through software, the virtualization, abstraction and automation of storage resources, can fully realize the deployment, management, monitoring, adjustment and other requirements of the data center storage system, so that the storage system has the characteristics of flexibility, freedom and high availability. Enterprise and Internet data are growing at a rate of 50% per year, and the total amount of structured data in the new data is limited, most of which are unstructured and semi-structured data. How to store a large amount of disorganized data and deep application processing, and quickly extract valuable information, forming business decisions will become the basis for the survival of various types of enterprises, and it is also the future development direction of storage and the business development direction around storage architecture.
Finally, from the perspective of data center security, security is a number of small items: attack protection, upgrade backup, bug catching/finding bugs, scripting tools, data security, service inspection, etc., each of which actually contains a lot of content. For example, when it comes to attack and protection, this mainly refers to preventing malicious and unintentional attacks on the data center by external abnormal intruders, which means that someone deliberately uses various attack methods to enter the data center and steal or destroy important data to achieve its ulterior motives. There are also unintentional attacks, because the entire data center is to maintain interconnection with the outside world, the operation is dynamic, changing, it is inevitable that there will be some abnormal traffic attacks the data center, sometimes even from within the data center, such as some server poisoning, or hardware failure, the construction of loops, abnormal traffic and other network failures, these will affect the operation of the data center, so how to do a good job in data center attack and protection is a big topic, this is not a big problem to deploy a few security equipment in the data center, it is necessary to carry out comprehensive and unified planning of the entire data center, and deploy some security protection measures in a targeted manner, and with the improvement of various hacking techniques, security protection measures should also be continuously improved, this is a process of continuous learning and improvement, as long as the data center is still running, this improvement will not stop. In order to facilitate operation and maintenance, it is also necessary to do some execution scripts, so that in the event of emergencies, problems can be quickly dealt with. For example, if there is an abnormality in the service of a data center, in order to quickly restore the service, it is necessary to adjust the route and direct all traffic to other data centers, which needs to be adjusted on the core router. The data center should also prepare many scripts for other tasks for quick use in case of emergency.
Through the above analysis, you must be surprised that data center operation and maintenance includes so many contents, large and small, and each item contains so many things, and each of them is not so simple to say, and also involves a lot of technical knowledge. Usually a data center is an information processing center of a company, enterprise or government department, and almost all business must be completed through the data center, so the data center is very important for an enterprise or government department. And whether a data center can operate stably and efficiently, operation and maintenance is the real key. Only by doing a good job in all aspects of operation and maintenance can the data center be stable for a long time.